本篇博文主要内容为 2026-06-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-06-23)

今日共更新1790篇论文,其中:

  • 自然语言处理246篇(Computation and Language (cs.CL))
  • 人工智能572篇(Artificial Intelligence (cs.AI))
  • 计算机视觉361篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习513篇(Machine Learning (cs.LG))
  • 多智能体系统32篇(Multiagent Systems (cs.MA))
  • 信息检索39篇(Information Retrieval (cs.IR))
  • 人机交互46篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)中系统级提示词优化(system-prompt optimization)的效能问题,即如何通过不依赖模型微调的方式,仅通过优化各智能体的系统提示词来提升整体系统性能。其核心挑战在于MAS中提示词优化面临指数级增长的搜索空间,且现有方法在单智能体场景中的成功难以直接迁移至多智能体环境。本文的关键解决方案是系统性地评估两种可自然扩展自最先进单智能体提示优化方法的提示优化器,在多种不同设置下(包括任务类型、工作流结构、通信协议及团队规模)对多智能体系统的性能影响。研究结果表明,提示词优化具备显著提升系统表现的潜力,但其增益程度和有效性高度依赖于具体系统配置,揭示了优化效果的边界条件与开放挑战。

链接: https://arxiv.org/abs/2606.23664
作者: Juyang Bai,Laixi Shi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Project page: this https URL ; Code: this https URL

点击查看摘要

Abstract:Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position within a workflow that governs inter-agent coordination and output aggregation. System prompts thus form a critical and accessible optimization surface: they specify agents’ roles and behaviors, enabling system-level improvements without model finetuning. Although prompt optimization has shown substantial potential for single LLMs, extending it to MAS poses distinct challenges, notably an exponentially growing search space. It remains unclear whether, when, and by how much prompt optimization improves MAS performance, and how sensitive such gains are to system configuration. In this work, we systematically study system-prompt optimization across a broad range of MAS setups varying in task, workflow, communication protocol, and team size, benchmarking two prompt optimizers that naturally extend state-of-the-art single-agent methods. The results reveal its potential to unlock significant gains while exposing open challenges, characterizing when and how much prompt optimization helps across diverse MAS settings.

[MA-1] Decentralized Autonomous Traffic Management through Corridor Networks

【速读】:该论文旨在解决在高密度自主飞行器(autonomous aircraft)规模化部署背景下,传统集中式交通管理方式因扩展性不足而难以有效协调大量有人与无人飞行器的问题。为应对这一挑战,研究提出基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的去中心化交通流管理框架,以实现先进空中移动(Advanced Air Mobility, AAM)走廊网络中飞行器轨迹规划的可扩展灵活性。其解决方案的关键在于:通过在单走廊环境中训练的MARL策略,在零样本(zero-shot)条件下直接推广至包含汇合与分流等复杂拓扑结构的多走廊网络,仅依赖局部协同的进入、穿越与退出行为,即可在无需中央协调或重新训练模型的前提下,实现对走廊边界合规性、任务完成率、平均速度、航程以及机间安全间隔等系统级性能指标的优异表现。实验表明,该方法具备良好的跨场景泛化能力,能够适应不同交通密度、网络几何形态及异构飞行器性能,从而在去中心化架构下自发形成高效、安全的空中交通流。

链接: https://arxiv.org/abs/2606.23585
作者: Jasmine Jerry Aloor,Aadarsh Govada,Hamsa Balakrishnan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Presented at the Second US-Europe Air Transportation Research and Development Symposium (ATRDS2026)

点击查看摘要

Abstract:As autonomous aircraft are introduced at scale and traffic density increases, centralized management becomes insufficient to coordinate the large numbers of crewed and uncrewed aircraft. Dedicated Advanced Air Mobility (AAM) corridors have therefore been proposed for organizing high-density autonomous traffic flows. The desire to scalably provide autonomous aircraft flexibility in trajectory planning motivates the development of decentralized approaches to traffic management in AAM corridors. In this work, we extend a multi-agent reinforcement learning (MARL) approach to address the challenge of decentralized traffic flow management in air corridor networks. We test policies trained in a single-corridor setting on increasingly complex multi-corridor networks with combinations of merges and splits in a zero-shot manner. Experimental results demonstrate that learned behaviors transfer well to scenarios with varying traffic density, network geometry, and heterogeneous vehicle performance, without needing centralized coordination or model retraining. We evaluate system-level performance in terms of conformance to corridor boundaries, completion rates, average speeds, distance traveled, and maintenance of inter-aircraft separation. We find that although our policies require only locally coordinated entry, traversal, and exit behaviors, they collectively produce desirable traffic flows through the corridor network. Comments: Presented at the Second US-Europe Air Transportation Research and Development Symposium (ATRDS2026) Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY) Cite as: arXiv:2606.23585 [cs.MA] (or arXiv:2606.23585v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.23585 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-2] Decomposing Financial Market Dynamics via Mechanism Analysis in an Evolutionary Multi-Agent Simulation

【速读】:该论文旨在解决演化型基于主体的市场模型(ABMs)中多机制耦合导致的因果模糊问题,即在传统设定下,繁殖策略、价格形成机制、主体偏见及共识传播等关键机制通常被固定,难以厘清各机制对涌现属性(如市场多样性、现实性、脆弱性)的具体影响。其解决方案的关键在于构建一个可插拔的协同演化、内生价格模拟器,通过将四个核心机制(选择机制、微观结构、行为偏差、共识网络拓扑)设计为独立可调的控制变量,并采用匹配的3×20种子干预实验进行单机制扫面分析。研究发现:(1)采用质量-多样性(QD/MAP-Elites)选择算子显著提升策略组合熵与策略轮换频率,优于截断式top-k选择;(2)单纯优化个体现实性奖励无法提升五项事实现实性指标,表明选择机制本身不足以增强模型真实性;(3)引入反射式价格反馈机制能有效提升模型现实性,尤其在危机与牛市阶段;(4)放大行为偏差会显著增加基因组脆弱性代理指标,但不影响现实性。而共识网络拓扑则未表现出稳健效应。整体贡献在于实现了对各机制作用的解耦分析,证明这些机制在单一机制扰动下可近似作为独立控制旋钮,分别调控市场多样性、现实性与脆弱性。

链接: https://arxiv.org/abs/2606.23158
作者: Zhibao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Evolutionary agent-based markets (ABMs) couple several mechanisms – who reproduces, how price forms, how biased the agents are, how consensus propagates – yet these are usually fixed by convention, so it is unclear which mechanism controls which emergent property. In a coevolving, endogenous-price simulator with 120 heterogeneous behavioral agents, we make four mechanisms pluggable and run matched 3x20-seed interventions. We find the levers are largely separable. (1) Selection - diversity: a Quality-Diversity (QD/MAP-Elites) operator robustly raises strategy-mix entropy over truncation top-k (paired Delta entropy +0.27 to +1.12 bits; sign-test p0.001; CIs exclude 0) and sustains more strategy cycling (strongest in crisis: Delta=+0.070, p=0.0004). (2) Selection does not improve realism: even a per-agent realism reward that provably steers selection does not raise 5-fact realism (Delta_5=-0.11,-0.08,+0.03; not significant). (3) Microstructure - realism: enabling reflexive price feedback does raise realism (Delta_5=+0.13,+0.20,+0.20; crisis/bull p0.05, all CIs positive). (4) Behavior - fragility: amplifying behavioral bias raises a genomic fragility proxy (Delta=+10.5,+11.1,+14.4; bull p0.001, all CIs positive) while leaving realism flat. The remaining mechanism – consensus network topology – shows no robust effect (honest null). The contribution is a decomposition: in these single-mechanism sweeps the mechanisms behave as approximately distinct control knobs over diversity, realism, and fragility.

[MA-3] RaMem: Contextual Reinstatement for Long-term Agent ic Memory

【速读】:该论文旨在解决大语言模型(LLM)代理在长期交互中因记忆压缩导致的“上下文坍缩”(context collapse)问题。当历史经验被压缩为可复用的记忆片段时,不同情境下的记忆可能因涉及相同的实体或用户状态而产生混淆,仅依赖检索无法保证记忆对当前查询具有有效证据价值。其解决方案的关键在于提出一种名为“代理记忆的上下文重建”(RaMem)的框架,通过四个协同阶段实现:(i) 证据锚定(evidence anchoring),将每条记忆与其原始情景条件(如事件时间、提及时间、会话跨度、参与者)关联;(ii) 回忆条件推导,从当前查询中提取隐含的证据条件;(iii) 有效性感知检索,利用推导出的条件优先选择上下文一致的记忆,并保留内容相关候选作为后备;(iv) 上下文保持合成,确保生成器可访问所选记忆的结构化上下文信息。实验表明,RaMem在多个长期记忆基准测试中显著优于现有强基线,平均F1值提升超过10%。

链接: https://arxiv.org/abs/2606.22844
作者: Wei Yang,Bryce Kan,Shixuan Li,Li Li,Yuehan Qin,Jiate Li,Paul Bogdan,Jesse Thomason
机构: University of Southern California (南加州大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Long-term memory has become increasingly important for LLM agents that operate across extended interactions and evolving task contexts. Recent memory systems have made past experiences more persistent, compact, and retrievable, but retrieval alone does not ensure that a memory provides valid evidence for the current query. When experiences are compressed into reusable fragments, memories from different situations may appear equally relevant if they involve recurring entities or user states. We refer to this failure as context collapse: memories lose the surrounding context needed to judge whether they provide valid evidence for the current query. To address this problem, we propose Contextual Reinstatement for Agentic Memory (RaMem), a framework that turns retrieved memory fragments into contextually verifiable evidence. RaMem operates through four coordinated stages: (i) evidence anchoring grounds each memory in its original episodic conditions, especially event time, mention time, session span, and participants; (ii) recall condition induction derives the evidence conditions implied by the query; (iii) validity-aware retrieval uses these conditions to prioritize context-compatible memories while retaining content-relevant candidates as fallback evidence; and (iv) context-preserved synthesis keeps the selected memories’ structured context available to the generator. Experiments on long-term memory benchmarks show that RaMem consistently improves performance over strong memory baselines, with average F1 gains of more than 10% across several backbones.

[MA-4] HERCULES: An Open-Source Simulation Framework for Heterogeneous Multi-Robot SLAM Collaborative Perception and Exploration

【速读】:该论文旨在解决现有异构多机器人自主系统在大规模、高保真动态环境中实现协同感知与导航所面临的仿真框架局限性问题。传统仿真平台在支持无人机(UAV)与地面机器人(UGV)并发运行、跨平台统一导航架构、复杂环境物理建模及多传感器时序同步等方面存在显著不足。其解决方案的关键在于构建HERCULES——一个基于UE5的开源仿真与数据采集管道,通过引入与现有无人机控制接口一致的航点追踪式地面机器人控制器,实现了异构平台间统一的导航栈(涵盖建图、可通行性分析、路径规划与控制),并集成物理驱动的长波红外(LWIR)相机与可配置夜视模式以应对低能见度环境。同时,系统提供轻量级API、ROS 2封装以及严格的跨传感器与平台时间同步机制,并融合高保真动态现象(如火灾蔓延、洪水、作物病害传播)与智能体(行人、交通、野生动物)行为,支持离线回放与在线闭环主动规划双模式运行。实验表明,HERCULES在异构多机器人同步定位与地图构建(SLAM)、协作感知与探索任务中均展现出卓越性能,其公开发布的代码、数据集与基准测试进一步推动了该领域研究发展。

链接: https://arxiv.org/abs/2606.22756
作者: Sandilya Sai Garimella,Daniel Chase Butterfield,Sean Wilson,Lu Gan
机构: Georgia Institute of Technology (美国佐治亚理工学院); Georgia Tech Research Institute (佐治亚理工学院研究机构)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 19 pages, 14 figures, and 12 tables

点击查看摘要

Abstract:We present HERCULES, an open-source simulator and data-collection pipeline for heterogeneous multi-robot autonomy. Built upon the Unreal Engine 5 (UE5)-based simulators AirSim and Cosys-AirSim, HERCULES resolves key architectural limitations of prior frameworks to enable concurrent unmanned aerial and ground vehicle (UAV-UGV) operation in large-scale, photorealistic, dynamic environments. It introduces a new waypoint-tracking UGV controller that mirrors existing UAV control interfaces, and provides a shared navigation stack for mapping, traversability analysis, planning, and control across heterogeneous platforms. Expanding inherited sensor suites, it adds physics-based long-wave infrared (LWIR) cameras and configurable night-vision modes for degraded visual environments. HERCULES provides lightweight APIs, ROS 2 wrappers, and rigorous time synchronization across sensors and platforms, and brings state-of-the-art game-engine capabilities into robotics simulation, integrating intelligent agents such as pedestrians, traffic, and wildlife with high-fidelity dynamic phenomena, including fire, flooding, and crop disease spread. HERCULES runs in two modes: passively, replaying offline-designed trajectories to generate reproducible multi-modal datasets, and actively, running an online planner in closed loop from live observations. Our experiments in heterogeneous multi-robot SLAM, collaborative perception, and exploration, using both HERCULES-generated data and active closed-loop execution, demonstrate its utility for advancing heterogeneous multi-robot autonomy. We publicly release our source code, experiment code, documentation, and datasets, including a heterogeneous multi-robot SLAM benchmark collected with two UAVs and two UGVs across kilometer-scale desert, forest, and city environments, at this https URL.

[MA-5] Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

【速读】:该论文旨在解决生成式人工智能(Generative AI)在分子性质预测任务中,通过闭环自动研究(Closed-loop Auto Research)框架实现的模型与特征优化是否具备跨验证集泛化能力的问题。核心挑战在于:当前自动化机器学习(AutoML)通常依赖于验证集信号进行配置选择,而这种基于验证性能的优化可能无法有效迁移到未见过的测试集,导致“过拟合验证集”的风险。其解决方案的关键在于构建一个分层的、可归因的实验设计——采用文件级消融锁(file-level ablation lock),将自动研究的动作空间解耦为三个独立轴:特征(features)、模型(models)和外部证据(external evidence),并分别评估各轴在36个不同分子性质预测任务上的贡献。实验结果表明,尽管部分优化在验证集上表现显著(如最大模型搜索增益达0.041),但在独立测试集上却大幅衰减至仅0.003,揭示了非转移性(non-transferable)行为的存在;而经过污染过滤器筛选后的高质量外部数据则实现了显著的测试集提升(如CYP2C9底物预测提升0.17,半衰期预测提升0.08),证明了可信外部知识引入的重要性。此外,与匹配试验的自动化机器学习对照组相比,语言模型代理在代码层面的干预能力显著更优,且其路由管道在共享训练划分上仍能与一个8400万参数的预训练3D模型竞争,凸显了闭环自动研究在真实世界应用中的潜力。本研究的核心洞见是:将发现过程与独立的持证验证(held-out certification)分离,是一种适用于任何以代理指标优化为目标的闭环系统的通用范式。

链接: https://arxiv.org/abs/2606.22731
作者: Jingjie Ning,Xiaochuan Li,Ji Zeng,Chenyan Xiong,Guolin Ke
机构: Carnegie Mellon University (卡内基梅隆大学); DP Technology (DP科技)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Closed-loop Auto Research extends automated machine learning from fixed-dataset fitting to changing the research workflow, with language-model agents editing representations and model code and acquiring external evidence. Molecular property prediction spans many small endpoints. We ask whether this action space yields improvements generalizing beyond the validation signal selecting them. We isolate three Auto Research axes, features, models, and external evidence, under a file-level ablation lock attributing each gain to one axis over a strong baseline. Across 36 endpoints in three benchmark suites we score each selected configuration once on a held-out test whose labels the search never read. A routed pipeline taking each endpoint’s best validation axis reaches positive held-out gains of 0.013, 0.011, and 0.042, the transferable axis differing by suite, data on TDC, model on Polaris, feature and model on MoleculeNet. The largest model-search gain falls from 0.041 on validation to 0.003 on test, while curated data reaches 0.022 but negative 0.019 on test, two non-transfer signatures. Curated external data raises held-out CYP2C9-substrate performance by 0.17 and half-life by 0.08, admitted through a contamination filter rejecting same-source files overlapping 64 to 89 percent of test structures, necessary but not sufficient for transfer. A matched-trial automated machine learning control did not reproduce the agent’s code-level model intervention, reaching 0.006 against 0.042, and the pipeline stays competitive with an 84M-parameter pretrained 3D model on the shared training split. The experiments stay within molecular property prediction, but separating discovery from held-out certification is a domain-agnostic lesson for any closed-loop system optimising a proxy for a held-out quantity. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2606.22731 [cs.AI] (or arXiv:2606.22731v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.22731 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-6] GARIP: A Running-Averag e Moving Reference for Last-Iterate Self-Play in Two-Player Zero-Sum Games

【速读】:该论文旨在解决双人零和博弈中自对弈(self-play)使用朴素梯度上升时导致的最后迭代轨迹围绕均衡点周期振荡的问题,即“最后迭代不收敛”这一核心挑战。其解决方案的关键在于引入一种基于运行平均参考策略(running average reference)的正则化方法——GARIP(Gradient Ascent with Running Average Reference and Projection),通过锚定至动态变化的运行平均策略而非固定或周期性快照,实现对训练轨迹的有效控制。研究揭示了一个关键机制:参考策略的滞后峰值(peak lag)决定了系统稳定性;在均值滞后相同的因果凸平均中,运行平均具有平坦的滞后分布,其峰值等于均值,而快照式参考则呈现锯齿状波动,峰值为均值的两倍(一元定理)。这一发现直接导出两个重要结论:其一,收敛性方面,在恒定锚定强度下,GARIP可实现局部最后迭代收敛——锚定项将基础映射的旋转效应压缩为 1β1-\beta 倍,突破稳定性边界,使原本循环的系统变为收缩映射(全局收敛在小 β\beta 下为猜想,大 β\beta 时存在共识失败);其二,鲁棒性方面,GARIP在矩阵博弈、硬币游戏及连珠棋/奥赛罗等棋类任务上表现与R-NaD相当,且显著优于固定参考或无参考基线,更适合作为默认超参数设置。实验表明,尽管全网格搜索下的崩溃率无统计差异,但在常规参数配置下,匹配均值滞后设置的运行平均参考在40次种子中仅0次崩溃,而快照参考需刻意缩短 KK 才能匹敌,凸显运行平均在实际应用中的优越性。此外,研究还发现预判性(负权重)参考在延迟侧表现更优,且该优势仅在存在深层自对弈循环(五层以上)时显现。所有实验基于纯JAX实现,具备完全可复现性。

链接: https://arxiv.org/abs/2606.22688
作者: Can Savcı
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Self-play with naive gradient ascent cycles in two-player zero-sum games: the last iterate orbits the equilibrium. Modern methods restore last-iterate convergence by regularizing toward a reference policy – MMD a fixed one (reaching only the regularized equilibrium), R-NaD a periodic snapshot (the engine of DeepNash). We study GARIP, which anchors to the running average, and isolate what the choice of reference controls. Our central result is a mechanism: collapse tracks the peak lag of the reference, and among causal convex averages of a fixed mean lag the running average (flat profile, peak = mean) uniquely minimizes that peak, while a snapshot’s sawtooth has peak = 2\times mean (a one-line theorem). Two consequences follow. Convergence: we prove local last-iterate convergence at constant anchor strength – the anchor scales the base map’s rotation by 1-\beta , crossing the stability boundary and turning a recurrent base into a contraction (global convergence is conjectured at small \beta ; we characterize a large- \beta consensus failure). Robustness: GARIP matches R-NaD’s peak performance – on matrix games, the Coin Game, and the board games Connect Four/Othello, both moving references are far more robust than fixed-magnet and magnet-free baselines – but is the better hyperparameter default; we report it both ways: over the full grid collapse rates are statistically indistinguishable, yet at conventional parameterizations a matched-mean-lag setting collapses in 0/40 vs 10/40 seeds (a snapshot matches it only by knowing to shorten K ). The boundaries: an anticipatory (negative-weight) reference does better still on the stale side, and the advantage appears only where naive self-play cycles (five deep self-play loops). All experiments are pure JAX and reproducible.

[MA-7] SHACR: A Graph-Augmented Semi-Autonomous Framework for Multi-Class Conflict Resolution in Smart Home IoT Automation

【速读】:该论文旨在解决智能家庭自动化中由异构物联网(IoT)设备间用户自定义规则并发执行所引发的隐藏交叉规则冲突问题。此类冲突源于共享设备、环境变量及物理拓扑结构,导致安全风险、资源浪费或隐私泄露等不可见行为,而传统基于文本的分析方法无法识别这些复杂交互。现有冲突检测机制多局限于静态语法错误或特定环境驱动的交互,缺乏统一建模能力与对非专家用户的可操作修复支持。本文提出SHACR框架,其核心在于通过形式化、有向的知识图谱(Knowledge Graph)约束大型语言模型(LLM)的推理过程,将设备、功能、物理状态及触发-条件-动作(TCA)规则编码为可遍历的类型化实体,并将物理因果关系显式建模为图边,从而实现逻辑、语义与物理层面冲突的统一检测。该框架采用闭环“扫描-解释-修复-验证”工作流,利用知识图谱限定LLM的动作空间,显著提升冲突识别准确性与可解释性。实验在包含70套公寓、203条规则的真实测试环境中验证,引入知识图谱后,分类错误率下降36.7%,F1值从0.59提升至0.79,少样本校准进一步使F1达0.95,而无图谱的LLM则表现几乎无改善。研究结论表明,结构化知识表示相较于提示工程或模型架构优化,对可靠物联网自动化管理具有更根本的重要性。

链接: https://arxiv.org/abs/2606.22312
作者: Leena Marghalani,Walid Aljoby,Suayb S. Arslan
机构: King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia; Boğaziçi University, Istanbul, Turkey; Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Smart home automation increasingly relies on user-defined rules across heterogeneous IoT devices. While these rules appear harmless in isolation, their concurrent execution creates hidden, cross-rule interactions via shared devices, environmental variables, and physical topology. These interactions result in unsafe, wasteful, or privacy-threatening behaviors that are completely invisible to text-only analysis. Existing conflict detectors remain siloed, catching either static syntactic conflicts or specific environment-mediated interactions without unifying the two or providing actionable repairs for non-expert users. This paper presents SHACR, a smart home conflict resolution framework that anchors Large Language Model (LLM) unpredictability by grounding its reasoning in a formal, directed knowledge graph. SHACR encodes devices, capabilities, physical states, and Trigger-Condition-Action rules as typed, traversable entities. By elevating physical cause-effect relationships to first-class graph edges, SHACR transforms conflict detection from fragile text inference into deterministic multi-hop graph traversal, unifying logical, semantic, and physical conflict classes. It drives a closed-loop Scan-Explain-Repair-Validate workflow that uses the graph to bound the LLM’s action space. We evaluated SHACR on a testbed of 203 rules deployed across 70 apartments within a smart building. By holding the underlying LLM fixed and introducing SHACR’s knowledge graph, classification errors drop by 36.7%, F1 rises from 0.59 to 0.79, and few-shot calibration further lifts F1 to 0.95, whereas the same calibration barely helps a graph-free LLM. Ultimately, this work challenges the current AI paradigm, establishing that structured knowledge representation is a far more critical factor for dependable IoT automation management than prompt engineering or underlying model architecture. Subjects: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) Cite as: arXiv:2606.22312 [cs.NI] (or arXiv:2606.22312v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2606.22312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-8] Revelio: Cost-Efficient Agent ic Memory Safety Vulnerability Detection For Repository-Scale Codebases

【速读】:该论文旨在解决生成式 AI 在内存安全漏洞检测中因幻觉(hallucination)导致的不可靠性以及在大规模代码库上难以规模化应用的问题。其核心解决方案在于提出一种成本高效、端到端的智能体框架 Revelio,通过生成可执行的漏洞证明(Proof-of-Vulnerability)并结合确定性消毒器(deterministic sanitizer)进行验证,有效消除幻觉风险;同时利用低成本大语言模型(LLM)与轻量级静态分析协同生成和排序漏洞假设,仅在漏洞可复现且经消毒器确认后才报告,从而在保证可信度的同时显著降低计算开销。实验表明,Revelio 在多个长期持续模糊测试的生产级项目及 CyberGym 基准上均发现了19个此前未知的内存安全漏洞,性能优于现有前沿编码智能体,在相近的令牌消耗下展现出卓越的可扩展性与可靠性。

链接: https://arxiv.org/abs/2606.22263
作者: Yiwei Hou,Hao Wang,Muxi Lyu,Marius Momeu,Eric Nguyen,Taige Yang,Koushik Sen,Dawn Song,David Wagner
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Memory safety vulnerabilities remain a significant threat even for projects with extensive fuzzing and manual auditing. Recent results suggest that large language models hold great promise for detecting such vulnerabilities, but they are unreliable, at risk of hallucination, and challenging to scale to repository-size codebases. This paper presents Revelio, a cost-efficient end-to-end agentic framework for memory-safety vulnerability discovery. Revelio addresses the problem of hallucination by generating an executable Proof-of-Vulnerability, which is checked with a deterministic sanitizer. It reduces cost using inexpensive LLMs and lightweight static analysis to help generate and rank vulnerability hypotheses, reporting vulnerabilities only when they can be reproduced and confirmed by a sanitizer. We evaluated Revelio on seven production-quality projects that had been continuously fuzzed for five to eight years, as well as on 100 randomly selected Arvo projects from the CyberGym benchmark. With around one hour per project and a total cost of 300, Revelio discovered 19 previously unknown memory-safety vulnerabilities. On benchmarks, Revelio outperformed frontier coding agents across diverse backbone models at comparable token costs. Our results suggest that Revelio enables scalable and trustworthy end-to-end LLM-based memory-safety vulnerability detection.

[MA-9] When Is Emergent Consensus Real? A Measured Coupling Gain and a Validity Diagnostic for LLM Agent Societies

【速读】:该论文旨在解决当前大语言模型(LLM)“代理社会”研究中缺乏可测量的控制参数、无法预测共识或极化出现时机,以及难以区分真实社会动态与模型自身偏差(模型伪影)的问题。其核心解决方案在于引入一个可量化、模型区分性强且具备因果解释力的耦合增益(gamma)指标,通过反事实扰动邻居观点的方式对每个代理进行测量。研究发现:(1)gamma值在五种前沿模型间稳定分布于0.15–0.43之间,具有高度可重复性与对重述的不变性,且社会邻近与数值锚点的gamma值相近,表明其反映的是证据耦合而非纯粹社交性;(2)基于实测系数的古典动力学模型(如Friedkin-Johnsen用于共识/多元主义,带符号拉普拉斯矩阵/结构平衡用于极化)能够有效组织系统行为;(3)前沿LLM不存在自发回弹(beta = 0),默认状态下不会自我极化,极化现象始终为外部诱导结果;(4)提出一种基于随机初始条件的诊断方法——通过最终意见与初始意见之间的斜率与偏移量关系,可有效区分真实平均化行为与模型先验导致的伪象(如对确定性事实的先验污染);(5)耦合具有情境依赖性,单对配对的gamma无法预测多邻接情形下的结果,甚至可能出现反向排序,而模态匹配的群体耦合则能有效预测(皮尔逊相关r = -0.70,置换检验p = 0.008)。因此,系统行为的涌现规律应基于目标交互中的匹配耦合度,而非单一邻接的gamma值。本研究贡献了一个可操作的测量协议与有效性验证工具,而非新理论框架。

链接: https://arxiv.org/abs/2606.22203
作者: Dongxu Yang
机构: DeepLethe
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: 13 pages (incl. appendix with proofs), 7 figures. Code and per-run logs released

点击查看摘要

Abstract:LLM “agent societies” are studied via demonstrations of emergent consensus or polarization – with no measurable control parameter, no theory of when each regime appears, and no test of whether an outcome is a genuine social dynamic or a model artifact. We introduce the coupling gain gamma, measured per-agent by counterfactually perturbing a neighbour’s stated opinion. (i) gamma is stable and model-distinguishing – across five frontier models it spans 0.15-0.43 (n=20, 95% CIs = 0.025), paraphrase-invariant; social-neighbour gamma roughly equals numeric-anchor gamma, so gamma is evidence-coupling, not uniquely social. (ii) Classical dynamics with measured (not assumed) coefficients organise the regime: Friedkin-Johnsen for consensus/pluralism, signed-Laplacian/structural-balance for polarization. (iii) Frontier LLMs do not spontaneously backfire (beta = 0), so default societies do not self-polarize – polarization is always induced; the beta0 branch arises only in the FJ surrogate, never in the agents. (iv) A randomized-initial-condition diagnostic – the (slope, bias) of final vs. initial opinion – separates genuine averaging from model-prior artifacts (boundary-censoring ruled out by construction via interior-valued facts); applied to a published “emergent consensus” result (Chuang et al. 2023) it reveals a model-specific conflation: averaging on debatable claims, prior-artifact on settled facts. (v) Coupling is context-dependent: pairwise gamma does not predict multi-neighbour outcomes – it can order them backwards – whereas a modality-matched group coupling does (sixteen closed+open models, Pearson r=-0.70, permutation p=0.008). The regime laws take this matched coupling, not the single-neighbour gamma: emergent consensus must be read from coupling in the target interaction. We contribute a measurement protocol and a validity instrument, not new theory.

[MA-10] Cohort Organized Learning: Clustering Through Agreement

【速读】:该论文旨在解决传统聚类方法中依赖显式距离或相似性计算所带来的局限性,尤其是在处理高维、非结构化数据时的效率与灵活性不足问题。其核心解决方案是提出一种基于神经网络的聚类框架——群体组织学习(Cohort Organized Learning, CoOL),该方法通过隐式地利用神经网络对数据分布进行建模,无需直接计算样本间的距离或相似度即可实现聚类。CoOL的关键在于结合期望最大化(Expectation Maximization, EM)算法推导出可用于训练神经网络的梯度,并通过监控训练过程中的收敛性来优化聚类结果,最终在训练后对生成的簇进行评估。该方法具有良好的泛化能力,可适用于向量数据和图像等多种数据类型,为复杂数据的无监督聚类提供了新的技术路径。

链接: https://arxiv.org/abs/2606.21743
作者: Finn Henry O’Shea,Maria Elena Monzani
机构: SLAC National Accelerator Laboratory (美国斯坦福直线加速器中心); Kavli Institute for Particle Astrophysics and Cosmology, Stanford University (斯坦福大学基础粒子天体物理与宇宙学研究所)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 20 pages with 14 figures and 4 tables

点击查看摘要

Abstract:In this article we describe Cohort Organized Learning (CoOL), a method for clustering data without explicit distance or similarity computations. Herein, we will describe CoOL, derive the gradients determined by expectation maximization to train the networks, show how to monitor convergence during training and evaluate the clusters after training, and discuss a series of examples and use cases. We also discuss CoOL’s limitations and future prospects on related tasks. Because CoOL uses neural networks to estimate the clusters, it can be used to cluster any data that can be made compatible and we illustrate this on vector data and images.

[MA-11] Hallucination as Context Drift: Synchronization Protocols for Multi-Agent LLM Systems

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统中由上下文漂移(context drift)引发的幻觉问题。传统观点认为幻觉主要源于模型能力不足,但本文指出,大量幻觉实际上源于并发智能体之间共享世界状态的知识状态不一致——即上下文漂移。当智能体在协作任务开始时对共享世界状态存在认知偏差或过时信息,其联合推理过程将产生矛盾,进而表现为幻觉输出。为此,论文提出两个核心解决方案:一是定义上下文分歧度量(Context Divergence Score, CDS),用于量化智能体间在空间、时间与任务维度上的知识状态差异;二是设计共享状态验证协议(Shared State Verification Protocol, SSVP),通过周期性交换压缩后的状态摘要,在联合推理前检测并预警高分歧状态。实验结果表明,传统的全广播同步机制反而会因错误状态传播导致幻觉率上升34%(HR: 0.658 vs. 0.492, p=0.0022),而SSVP有效规避了这一污染效应,实现幻觉率显著降低(HR: 0.463, d=0.30),且相比全广播同步减少58%的API调用。在软件项目规划任务中,所有策略均收敛至低幻觉水平(HR: 0.2),进一步验证了污染效应仅存在于依赖单一错误共享信念跨维度传播的任务场景。研究结论将幻觉缓解重新定位为分布式系统层面的问题,并确立上下文同步作为多智能体大语言模型架构中的首要设计原则。

链接: https://arxiv.org/abs/2606.21666
作者: Carson Rodrigues
机构: Celabe; Anthropic
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Multi-agent LLM systems routinely produce hallucinated outputs that cannot be explained by model deficiencies alone. A significant class of these failures arises not from model incapacity but from context drift: the divergence of internal knowledge states between concurrent agents. When agents enter a collaborative task with mismatched or stale representations of shared world state, their joint reasoning produces contradictions that manifest as hallucination. We define the Context Divergence Score (CDS), a lightweight scalar metric quantifying knowledge-state discrepancy between agent pairs across spatial, temporal, and task dimensions, and propose the Shared State Verification Protocol (SSVP), which lets agents periodically exchange compressed state summaries and flag high-divergence conditions before joint reasoning. We evaluate SSVP across two domains (multi-agent travel and software project planning) using Claude Haiku. In controlled experiments (n=30 per condition, travel; n=10, software) across 8 scenarios, naive full-broadcast synchronization increases hallucination rate by 34% above the no-sync baseline (HR: 0.658 vs. 0.492, p=0.0022, d=1.18), a contamination effect from propagating erroneous agent states. SSVP avoids this failure mode while showing modest, consistent reduction (HR: 0.463, d=0.30) and achieves significantly lower hallucination than full-broadcast (p=0.0005, d=1.47) using 58% fewer API calls. The contamination effect does not replicate in the software domain, where all conditions converge to low HR (0.2), confirming it is specific to tasks where one erroneous shared belief cascades across evaluation dimensions. Our results reframe hallucination mitigation as a distributed systems problem and establish context synchronization as a first-class primitive in multi-agent LLM design.

[MA-12] Monitoring Diameters of Causal Communication Graph with Spatio-Temporal Logic

【速读】:该论文旨在解决多智能体系统(multi-agent systems)在连续时间空间中动态行为验证时,对拓扑性质精确检查的挑战,尤其针对通信链路距离约束、可达性以及通信路径长度或代价追踪等关键问题。现有μTGL逻辑虽能表达时空交织特性,但在刻画特定距离范围内的可达性或通信链成本方面表达能力不足,难以支持去中心化监控与分布式协议的图论分析。为此,论文提出对μTGL的扩展,引入“空间视界”(space horizon)算子,通过该算子可对通信链路的最大传播距离进行显式约束,从而显著提升逻辑的表达能力。该扩展不仅能够编码其他逻辑中的可达性、逃逸模态等原本无法表达的语义,还强化了空间与时间属性之间的深层耦合。研究进一步设计了一种集中式离线监测算法,并在共识捆绑算法(Consensus-Based Bundle Algorithms)等任务分配协议的仿真案例中进行了验证,展示了其在实际分布式系统验证中的有效性。

链接: https://arxiv.org/abs/2606.21558
作者: Lydia Bakiri,Jérémy Dubut,Sergio Mover
机构: 未知
类目: Multiagent Systems (cs.MA); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Verification of multi-agent systems requires the ability to check meticulous topological properties when it comes to agents that can move through space in continuous time. This demands a logic with sufficient expressiveness to capture these dynamics. MuTGL logic has interesting properties for expressing entangled space-time properties. However, this logic lacks the expressivity needed to analyse reachability within specific distance bounds, or to track the length or the cost of communication chains: these are fundamental for decentralized monitoring, or graph-theoretic analysis of distributed protocols, where algorithmic complexities often relates with the system’s communication graph diameter. We then introduce an extension of muTGL, including a new operator called the space horizon. This addition allows us to bound the distance of communication chains, hence enhancing the logic’s expressiveness. We show that this operator allows to encode modalities from other logics, such as reachability or escaping which were not available in vanilla muTGL, while allowing a deeper entanglement of spatial and temporal properties. We provide a centralized offline monitoring algorithm for this logic and illustrate it on several examples on simulations of Consensus-Based Bundle Algorithms, distributed protocols for task allocation.

[MA-13] Simultaneously Efficient Allocation of Indivisible Items Across Multiple Dimensions

【速读】:该论文旨在解决多维不可分物品分配中的效率保障问题,即在多个评价维度上同时实现高效分配的理论极限。传统方法通常通过单一综合目标优化来分配资源,但这种方法可能掩盖某些维度上的严重损失,尤其当物品对不同维度的贡献存在异质性时。为此,论文提出多维高效分配(Multidimensional Efficient Allocation, MDEA)模型,其中每个代理在各维度上具有可加性估值,并研究在效用最大化社会福利(Utilitarian Social Welfare, USW)与平等主义社会福利(Egalitarian Social Welfare, ESW)下的同步效率。其核心发现是:对于精确效率,最大化达到USW最优的维度数量仅能获得与维度数ℓ相关的$ c/\ell 近似,且该依赖关系本质上不可避免;而对于ESW,即使在二值估值下,判断两个维度能否同时最优已是NP难问题。在近似同步效率方面,论文揭示了一个紧致阈值-近似,且该依赖关系本质上不可避免;而对于ESW,即使在二值估值下,判断两个维度能否同时最优已是NP难问题。在近似同步效率方面,论文揭示了一个紧致阈值 \Theta(1/\ell) ,表明无论对于USW还是ESW,总存在对所有维度均成立的,表明无论对于USW还是ESW,总存在对所有维度均成立的 1/\ell $级近似保证,而任何更优的维度依赖关系在渐进意义上均不可能实现,即便在二值估值情形下亦然。此外,论文引入三种自然的多维帕累托(Pareto)概念,并系统刻画了它们之间的关系及其计算复杂性,为多维公平分配提供了坚实的理论框架。

链接: https://arxiv.org/abs/2606.21346
作者: Yasushi Kawase,Bodhayan Roy,Mohammad Azharuddin Sanpui
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Many allocation problems are intrinsically multidimensional, since an item may contribute differently to several criteria, and optimizing a single aggregate objective can hide severe losses in other dimensions. We study how much efficiency can be guaranteed simultaneously when indivisible items have multiple attributes. To this end, we introduce the \emphmultidimensional efficient allocation (MDEA) model, where each agent has an additive valuation in each dimension, and investigate simultaneous efficiency under utilitarian social welfare (USW) and egalitarian social welfare (ESW). Our results reveal a sharp worst-case frontier. For exact efficiency, maximizing the number of dimensions attaining the USW optimum admits a c/\ell -approximation for every fixed constant c , and this dependence on the number \ell of dimensions is essentially unavoidable; for ESW, even deciding whether two dimensions can be optimized simultaneously is NP-hard with binary valuations. For approximate simultaneous efficiency in every dimension, we identify a tight threshold of order 1/\ell , showing that such guarantees always exist for both USW and ESW, while any asymptotically better dependence on \ell is impossible, even for binary valuations. Finally, we introduce three natural multidimensional Pareto notions and characterize both their relationships and their computational complexity.

[MA-14] Negative Knowledge as Failure-aware Shared Memory for AutoResearch

【速读】:该论文旨在解决生成式人工智能在科研过程中产生的大量失败实验尝试难以被有效积累与复用的问题,即当前AI辅助科研系统中“失败经验”缺乏结构化存储与共享机制,导致重复试错、资源浪费。其核心解决方案是提出一种负向知识记忆层(negative knowledge memory layer):通过一个专职的“策展代理(curator agent)”将每一次失败的实验转化为带有边界约束和类型标注的标准化记录,并存入共享的知识库;后续的研究代理在设计新实验前,可主动查阅并显式采纳或拒绝这些负向知识记录,从而避免重复错误。该方法在两个场景下验证有效:一是在ScienceAgentBench上的同任务重试,二是在非线性数学物理偏微分方程(PDE)问题上的跨任务科研探索。结果表明,引入负向知识记忆层的系统不仅显著优于基线的AutoResearch模型,且在更少的上下文令牌(tokens)消耗下实现更高效率,甚至能解决所有基线均无法攻克的新任务。此外,该负向知识库具备良好的迁移能力,可在不同PDE问题间提升自动科研性能。研究揭示,结构化的负向知识应被视为与正向发现同等重要的科学知识资产,不应仅作为记忆压缩或调试工具,而应作为更广泛人工智能参与科研背景下集体科学记忆的基础设施进行系统性维护。

链接: https://arxiv.org/abs/2606.21024
作者: Hanchun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:AI-assisted research systems generate many failed attempts, but those failures rarely become a durable, shared knowledge asset. We propose a negative knowledge memory layer: a curator agent converts each failed attempt into a bounded, typed record in a shared bank, and a downstream research agent explicitly adopts or rejects those records before proposing its next experiment. We evaluate this layer in two settings: same-task retry on ScienceAgentBench and cross-task scientific research on two nonlinear math-physics PDE problems. The negative knowledge layer outperforms vanilla AutoResearch baselines while using fewer tokens; agents with the negative knowledge bank solve new tasks that all baselines fail to solve in PDE systems research. We also show that the previous negative knowledge bank can transfer and enhance AutoResearch on different PDE problems. These results suggest that structured negative knowledge is a knowledge asset that should be explicitly maintained in broader AI-engaged scientific research beyond a memory-compression or debugging aid, alongside positive findings, as a collective infrastructure for scientific memory. Code is available at this https URL.

[MA-15] Heterogeneous Policy Networks for Composite Robot Team Communication and Coordination

【速读】:该论文旨在解决异构多智能体系统中协作与通信策略学习的难题,核心问题在于如何在具有不同状态空间、动作空间和观测空间的智能体之间有效建模异构性,从而实现高效且智能的联合协调与通信。现有基于同质图网络的多智能体强化学习(MARL)方法因未能充分考虑智能体异构性,导致通信效率低下甚至损害团队性能。本文提出的解决方案关键在于扩展异构策略网络(HetNet),通过引入异构图注意力网络(heterogeneous graph-attention networks),不仅能够学习多样化的异构协作策略,还支持端到端训练以生成高度高效的二值化消息传递机制。实证结果表明,该方法在多个任务域中相较于次优基线实现了5.84%至707.65%的性能提升,同时将通信带宽需求降低200倍,显著推动了异构机器人团队规模化协同的可行性。

链接: https://arxiv.org/abs/2606.20962
作者: Esmaeil Seraj,Rohan Paleja,Luis Pimentel,Kin Man Lee,Zheyuan Wang,Daniel Martin,Matthew Sklar,John Zhang,Zahi Kakish,Matthew Gombolay
机构: Georgia Institute of Technology(佐治亚理工学院); Carnegie Mellon University(卡内基梅隆大学); Sandia National Laboratories(桑迪亚国家实验室)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: IEEE Transactions on Robotics (T-RO)

点击查看摘要

Abstract:High-performing human-human teams learn intelligent and efficient communication and coordination strategies to maximize their joint utility. These teams implicitly understand the different roles of heterogeneous team members and adapt their communication protocols accordingly. Multi-Agent Reinforcement Learning (MARL) has attempted to develop computational methods for synthesizing such joint coordination-communication strategies, but emulating heterogeneous communication patterns across agents with different state, action, and observation spaces has remained a challenge. Without properly modeling agent heterogeneity, as in prior MARL work that leverages homogeneous graph networks, communication becomes less helpful and can even deteriorate the team’s performance. In the past, we proposed Heterogeneous Policy Networks (HetNet) to learn efficient and diverse communication models for coordinating cooperative heterogeneous teams. In this extended work, we extend Heterogeneous Policy Networks (HetNet) to support scaling heterogeneous robot teams. Building on heterogeneous graph-attention networks, we show that HetNet not only facilitates learning heterogeneous collaborative policies but also enables end-to-end training for learning highly efficient binarized messaging. Our empirical evaluation shows that HetNet sets a new state of the art in learning coordination and communication strategies for heterogeneous multi-agent teams by achieving an 5.84% to 707.65% performance improvement over the next-best baseline across multiple domains while simultaneously achieving a 200x reduction in the required communication bandwidth.

[MA-16] Artificial collectives of specialists and generalists excel at different tasks

【速读】:该论文旨在解决生成式人工智能(Generative AI)背景下多智能体系统在资源效率与集体性能之间的核心矛盾,即缺乏对人工集体(artificial collectives)的描述性科学理解,导致难以设计高效、适应性强的多智能体系统。其关键解决方案在于通过系统性实验揭示智能体的解释能力(interpretive abilities)、理性约束(rationality bounds)与任务特性之间的动态交互机制:当任务涉及生成、选择与协调时,具备广泛解释能力的通才型智能体集体表现更优;而涉及协商的任务则由少数具备通才能力的中介者连接的专业化集体更具优势。同时,理性约束在不同水平下调节这一效应——在宽松约束下,专业化智能体凭借对高维决策空间的有效采样占优;在严格约束下,通才智能体因更优的梯度估计能力表现更佳;而在中等约束条件下,性能与收敛速度之间呈现出根本性权衡。研究指出,多智能体系统的设计应根据任务需求与智能体计算极限进行匹配,从而优化系统整体效率与能效。

链接: https://arxiv.org/abs/2606.20877
作者: John Meluso,Laurent Hébert-Dufresne,Christoph Riedl,H. Oliver Gao
机构: Cornell University(康奈尔大学); University of Vermont(佛蒙特大学); Santa Fe Institute(圣塔菲研究所); Northeastern University(东北大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Collective artificial intelligence, where multiple agents work on shared tasks, holds potential to solve expansive problems in fields from medicine to collective governance. But while prescriptive engineering solutions abound, we lack descriptive scientific understanding of artificial collectives, and therefore principles for how to design resource efficient multi-agent systems. Through systematic experiments with optimizing agents, we characterize how agent interpretive abilities, rationality bounds, and task qualities interact to shape collective performance. Agents range from specialists, with narrow interpretive abilities, to generalists, with broad ones. Collectives of specialists correspond to sparse, centralized networks, while collectives of generalists correspond to dense, decentralized ones. We show that interpretive network properties have small performance effects on average (0.07 standard deviations of performance). However, for specific task qualities, these effects are 4.5 times larger (0.33 sd) and can reach much higher for certain task qualities (1.84 sd). This leads collectives of generalists to perform better on tasks that involve generating, choosing, and coordinating, while collectives of specialists with a few generalist mediators perform better on tasks that involve negotiating. Rationality bounds then moderate these relationships. At loose bounds, specialists outperform generalists through more effective sampling of high-dimensional decision spaces. At tight bounds, generalists outperform specialists through better gradient estimation. A fundamental trade-off between performance and convergence speed emerges at moderate bounds. These findings suggest that multi-agent design could benefit from matching interpretive networks to both task demands and agents’ computational limits, with implications for the efficiency and energy costs of multi-agent systems.

[MA-17] Process-Reward Tactic Evolution for Long-Horizon Bioinformatics Workflows

【速读】:该论文旨在解决生成式 AI 在生物信息学领域中执行长周期、复杂工作流任务时的可靠性与可追溯性问题,特别是在涉及类型化数据对象、工作流软件交互、溯源记录及生物学有效性验证等关键需求下的挑战。其核心问题是:现有大语言模型(LLM)代理虽具备代码生成与工具调用能力,但在需要多步骤协调、状态管理与生物学合理性保障的生物信息学工作流执行中表现不足。解决方案的关键在于提出一种基于 Galaxy 平台的“过程-奖励策略进化”(Process-Reward Tactic Evolution, PRTE)训练框架,将经过验证的工作流部署过程转化为可复用的“策略”(tactics)。该框架通过在受控环境(Agent Gym)中进行课程化训练,利用过程验证器对工作流构建、软件交互、执行过程及生物学正确性进行评分,并将成功与失败的执行轨迹提炼为策略库。在推理阶段,经训练的执行器利用该策略库,在隔离环境中执行来自 Peer-Reviewed Galaxy 工作流及 BioWorkflow Bench、BioAgent Bench 的测试任务,从而实现对长周期生物信息学工作流的高效、准确执行。实验表明,该方法显著提升了工作流完成率、生物学正确性与执行效率,优于无记忆和反思式基线方法。

链接: https://arxiv.org/abs/2606.20839
作者: Lingzhi Yang,Yubo Fan,Song Wu,Gilchan Park
机构: Stony Brook University (石溪大学); Vanderbilt University (范德堡大学); Brookhaven National Lab (布鲁克海文国家实验室)
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLM agents can write code and call tools, but reliable bioinformatics work requires long-horizon interaction with workflow software, typed data objects, provenance, and biological checks. We study this setting through Galaxy workflow execution. The agent must explore task data, construct or adapt an executable workflow DAG, bind inputs and dataset collections, monitor execution, debug failures, and validate biological outputs. We propose Process-Reward Tactic Evolution, a Galaxy-based training framework that turns verified workflow rollouts into reusable \tactics. During training, agents practice on curriculum-organized Galaxy tasks in Agent Gym; process verifiers score workflow construction, software interaction, execution, and biological correctness; successful and failed traces are distilled into a tactic library. At inference, the trained executor, Process-Reward Tactic Evolution, uses this library to execute held-out peer reviewed Galaxy workflow converted BioWorkflow Bench and BioAgent Bench tasks in isolated environments. The paper evaluates whether process-supervised tactic accumulation improves long-horizon bioinformatics workflow completion, biological correctness, and execution efficiency over no-memory and reflection-style baselines.

[MA-18] Integrating Large Language Model Agents with Digital Twins for Industrial Autonomous Systems

【速读】:该论文旨在解决工业自动化系统在动态复杂环境中缺乏自适应与可泛化推理能力的问题,尤其针对传统基于规则的系统因逻辑固化而无法自主响应变化、难以实现跨异构组件的统一任务理解与执行的瓶颈。其核心解决方案在于提出一个三层集成框架,将大语言模型(LLM)、数字孪生(Digital Twin)与自动化系统深度融合,构建具备自主性(Autonomy)的智能系统。该框架以任务-过程-服务-资源(TPSR)模型为核心,将用户意图转化为可执行流程,并通过四种明确的LLM角色——流程编排、服务匹配、数字资源生成及“代理即服务”(Agent-as-a-Service),实现从任务规划到执行的端到端自适应处理。研究采用设计科学方法论,通过五项同行评审的研究工作逐步验证并优化该框架,在多个案例研究与原型系统中实现了事件驱动控制、基于仿真的参数化配置及数字模型自动生成,显著提升了任务可执行率、指令准确率与内容生成质量,同时大幅降低人工干预需求。关键创新点在于将基于LLM的通用推理能力系统性地嵌入工业自动化体系,为实现高灵活性、人机协同的下一代智能制造提供了可扩展的技术路径。局限性包括对精确数字表示的依赖、大语言模型的高计算开销,以及在安全关键场景下仍需人类监督。

链接: https://arxiv.org/abs/2606.20761
作者: Yuchen Xia
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Doctoral Dissertation, University of Stuttgart. Doctoral Exam Video Recording: this https URL

点击查看摘要

Abstract:Industrial automation is being transformed by digitalization and the increasing use of cyber-physical systems. Modern production environments require greater adaptability, faster reconfiguration, and more intuitive human-machine interaction. However, traditional rule-based systems rely on fixed logic and cannot autonomously adapt to changing conditions. Consequently, current automation systems lack a systematic approach for integrating adaptive and generalizable reasoning capabilities for interpreting, planning, and executing user tasks across dynamic environments and heterogeneous components. This dissertation proposes a three-layer framework that integrates large language models (LLMs), digital twins, and automation systems into an autonomous system. Autonomy is defined as a design property assigned to system components and enabled through LLM-based reasoning to achieve adaptive, goal-oriented behavior. The Task-Process-Service-Resource (TPSR) model is introduced to transform user tasks into executable processes. Four LLM roles are identified: process orchestration, service matching, digital resource generation, and agent-as-a-service. Five peer-reviewed studies develop and refine these concepts using the design science research methodology. Case studies and prototypes demonstrate adaptive task planning, event-driven control, simulation-based parameterization, and digital model generation. Results show high task executability, command correctness, and content-generation accuracy while reducing manual effort. The framework enables the integration of LLM-based reasoning into industrial automation systems and improves adaptability and usability. Limitations include dependence on accurate digital representations, the computational demands of LLMs, and the need for human intervention in safety-critical situations. Comments: Doctoral Dissertation, University of Stuttgart. Doctoral Exam Video Recording: this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Systems and Control (eess.SY) Cite as: arXiv:2606.20761 [cs.SE] (or arXiv:2606.20761v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.20761 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.18419/opus-18222 https://doi.org/10.2370/9783819106552 Focus to learn more DOI(s) linking to related resources

[MA-19] Empowering Economic Simulation Through Situation-Aware Llm -Driven Generative System ICASSP2026

【速读】:该论文旨在解决传统经济建模(TOP-DOWN范式)忽视个体差异与社会互动复杂性的问题,以及现有基于代理的建模(Agent-Based Modeling, ABM)系统在面对未预设场景时泛化能力不足的局限。其核心解决方案在于提出SAMAS框架,通过将大型语言模型(LLM)嵌入代理个体,使其具备丰富的宏观经济理解能力与历史模拟轨迹经验,从而在微观层面实现类人决策行为。该方法通过联合建模宏观结构模式与微观动态行为,显著提升了对市场波动真实性的捕捉能力及关键转折点的预测精度。

链接: https://arxiv.org/abs/2606.20720
作者: Zhimei Chen,Mu Chen
机构: 未知
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
备注: ICASSP 2026

点击查看摘要

Abstract:Traditional economic modeling typically follows a TOP-DOWN paradigm, neglecting individual diversity and the complexity of social interactions. To better capture the complexity of societal structure, Agent-Based Modeling (ABM) employs a BOTTOM-UP solution by incorporating micro-level dynamics to generate macroeconomic phenomena. Reinforcement Learning further improves its decision-making ability through tailored reward signals. However, existing ABM systems struggle to generalize beyond predefined scenarios. Recognizing the potential of LLM-driven role-playing in perception and human-like decision-making, we propose SAMAS, which models individual agents with rich macroeconomic understanding embedded in LLMs and economic trajectories experienced in the passing simulation steps. By jointly modeling both macro-level structural patterns and micro-level dynamic behaviors, SAMAS achieves superior performance in volatility realism and turning point prediction.

[MA-20] BARD-MARL: Byzantine-Agent Detection for Learned Communication in Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中因学习型通信机制引入的可信性问题:训练后的策略可能通过故障或敌对智能体传递信息,从而威胁系统整体可靠性。其核心解决方案是提出BARD-MARL,一种基于贝叶斯图(BayesG)的后处理诊断层,用于在自适应交通信号控制场景中检测拜占庭智能体(Byzantine Agent)。BARD-MARL的关键在于融合两种智能体级别的证据流——从状态-动作轨迹中提取的策略图特征(policy-graph features)与基于BayesG潜在掩码概率计算的贝叶斯信任统计量(Bayesian trust statistics),实现互补性诊断。实验结果表明,这两种信号在不同攻击类型下表现各异,无绝对主导;在25智能体网格中,面对10%观测翻转攻击时,联合模型达到0.843 AUC-ROC,而仅依赖策略图的检测在10%协同攻击下可达到0.917;在100智能体网格中,统一版本的BARD-MARL在固定动作与协同攻击下均达到0.982 AUC-ROC。研究证实,学习型通信策略能提供有价值的可诊断证据,但构建可信的鲁棒性结论必须结合特定攻击场景的消融分析,并明确区分协调、检测与缓解三个阶段的功能边界。

链接: https://arxiv.org/abs/2606.20701
作者: Almond Kiruthu Murimi
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 8 pages, 4 figures; arXiv preprint; ancillary reproducibility/code bundle included

点击查看摘要

Abstract:Learned communication improves coordination in cooperative multi-agent reinforcement learning, but it also creates a trust problem: a trained policy may route information through agents that have become faulty or adversarial. This paper studies Byzantine-agent detection for learned-communication MARL in adaptive traffic signal control. We propose BARD-MARL, a post-hoc diagnostic layer on top of BayesG, which is used as an attributed communication substrate rather than as a contribution of this paper. BARD-MARL combines two agent-level evidence streams: policy-graph features extracted from state-action trajectories and Bayesian trust statistics computed from BayesG latent mask probabilities. Across fixed-action, observation-flip, random-noise, and coordinated attacks in SUMO traffic grids, the results show that these signals are complementary rather than universally dominant. On a 25-agent grid, BARD-MARL reaches 0.843 AUC-ROC under a 10% observation-flip attack, while policy-graph-only detection reaches 0.917 AUC-ROC under a 10% coordinated attack. On a 100-agent grid, the unified BARD-MARL variant reaches 0.982 AUC-ROC for both 10% fixed-action and 10% coordinated attacks. The study shows that learned communication policies expose useful diagnostic evidence, but credible resilience claims require attack-specific ablations and explicit separation between coordination, detection, and mitigation.

[MA-21] Machine-Coached Policy Revision in Adaptive Agent -Based Regulatory Simulation: A Controller-Level Contestability Layer

【速读】:该论文旨在解决现有面向政策的基于代理的模型(agent-based models, ABM)在应对复杂适应性社会技术系统中的监管干预时,诊断流程普遍为事后分析(ex post)且缺乏反馈闭环的问题。具体而言,当前大多数自适应ABM框架虽能区分静态与自适应代理、固定与可调政策及不同控制器设计,但其诊断结果未能系统性地回流至政策控制器以实现动态优化。为此,论文提出一种轻量级的机器辅导(machine-coached)政策修订层,其核心在于将政策决策形式化为具有明确冲突与优先级的可撤销规则(defeasible rules),支持对控制器行为生成解释,并可将诊断失败转化为规则的增删或优先级调整。该方法不追求构建新的最优控制器,亦不提供无限制机器辅导的形式保证,而是通过仿真兼容的方式实现控制器层面的可争议性(contestability):即政策决策可被解释、质疑、修订并在独立仿真中重新评估。在简化版碳排放监管ABM实验中,针对VPVA制度下的过度保守性故障,预设的辅导模板引入松弛规则后,在保持违规、超调和波动率等约束条件的前提下,显著降低了过度保守性的复发频率。研究主张,机器辅导应被视为可解释自适应ABM的控制器层级扩展,与因果、信息论及轨迹分析等诊断方法形成互补。

链接: https://arxiv.org/abs/2606.20700
作者: Roberto Garrone
机构: Open University of Cyprus (塞浦路斯开放大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 26 pages, 2 figures, 14 tables. Methodological study of machine-coached policy-controller revision in adaptive agent-based regulatory simulation

点击查看摘要

Abstract:Policy-oriented agent-based models are increasingly used to study regulatory interventions in complex adaptive socio-technical systems. Recent adaptive ABM frameworks distinguish between static and adaptive agents, fixed and adaptive policies, and alternative controller designs. However, most diagnostic workflows remain ex post: trajectories are analysed after simulation, but the resulting evidence is not systematically fed back into the policy controller. This paper proposes a lightweight machine-coached policy-revision layer for adaptive agent-based regulation. The layer represents policy decisions as defeasible rules with explicit conflicts and priorities, generates explanations for controller actions, and allows diagnostic failures to be translated into rule additions, removals, or priority changes. The contribution is not a new optimal controller and does not claim formal guarantees for unrestricted machine coaching. Instead, it provides a simulation-compatible operationalization of controller-level contestability: policy decisions can be explained, challenged, revised, and re-evaluated in held-out simulation runs. A stylized emissions-regulation ABM is used as the experimental component. A controlled simulation experiment focuses on an over-conservatism failure in the VPVA regime. The predefined coaching template adds a relaxation rule to the symbolic controller, reducing over-conservatism recurrence under held-out seeds while preserving violation, overshoot, and volatility guardrails. The paper argues that machine coaching is best understood as a controller-level extension of explainable adaptive ABM, complementary to causal, information-theoretic, and trajectory-based diagnostics.

[MA-22] Structural Distinguishability of Static and Adaptive Policy Regimes in Agent -Based Regulatory Simulation

【速读】:该论文旨在解决现有基于代理的模型(Agent-based Models, ABMs)在政策评估中将监管机制视为固定参数所带来的局限性,即无法区分政策结论是源于代理适应、政策适应,还是两者交互作用所致。其解决方案的关键在于构建一个受控的仿真基准测试框架,通过单一可配置的排放监管ABM,在相同仿真条件下对比四种制度组合:固定政策/固定代理、固定政策/适应性代理、适应性政策/固定代理以及适应性政策/适应性代理。研究进一步评估了静态政策、追踪感知的校准固定政策及三种自适应控制器(设定点控制、安全裕度控制、单边控制),并利用标量指标、相对于目标值的符号诊断、轨迹模式特征与可视化分析等多维度方法,揭示尽管平均结果相似,不同制度下的监管表现仍存在本质差异。因此,论文强调,面向适应性政策的ABM评估应注重“制度可区分性”(regime distinguishability),而不仅依赖平均性能指标。

链接: https://arxiv.org/abs/2606.20699
作者: Roberto Garrone
机构: Open University of Cyprus (塞浦路斯开放大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 4 figures, 9 tables. Simulation-based methodological study

点击查看摘要

Abstract:Agent-based models are widely used to evaluate policy interventions in complex socio-technical systems, yet many policy-oriented ABMs represent regulation as a fixed scenario parameter. This limits their ability to distinguish whether regulatory conclusions depend on agent adaptation, policy adaptation, or the interaction between both. Building on a previously proposed four-regime architecture, this paper contributes a controlled simulation benchmark rather than a new general framework. Using a single configurable emissions-regulation ABM, we compare constant policy/constant agents, constant policy/adaptive agents, adaptive policy/constant agents, and adaptive policy/adaptive agents under matched simulation conditions. We evaluate naive fixed policies, tracking-aware calibrated fixed policies, and three adaptive controllers: setpoint, safety-margin, and one-sided control. The benchmark recovers expected controller archetypes: setpoint control tracks the cap but produces frequent boundary crossings, safety-margin control reduces violations through conservatism, and one-sided control can limit violations but may ratchet toward over-conservatism when combined with adaptive agents. The contribution is methodological: scalar indicators, cap-relative symbolic diagnostics, trajectory motifs, and visual inspection jointly reveal how regulatory conclusions can differ even when average outcomes appear similar. Adaptive policy-oriented ABMs should therefore be evaluated through regime distinguishability, not only through average performance.

[MA-23] How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)协调架构评估中存在的一致性与可重复性问题,具体聚焦于:在相同模型与基准测试环境下,不同协调协议在初始试验(trial 0)阶段因配置等价但机制差异所导致的性能差异是否具有统计显著性。其核心挑战在于,现有研究常以微小的基准得分差距作为某架构优于另一架构的证据,但未充分检验这些差异是否源于随机波动或协议本身的内在不稳定性。论文的关键解决方案是通过构建“配置等价”(configuration-equivalent)的对照实验——即通过代码审查与SHA-256字节审计确保输入接口完全一致,从而排除非协调机制因素干扰——并在此基础上进行配对对比分析。研究在Claude Haiku 4.5模型与tau^2-bench零售基准上,比较了无协调(no_coord)与拦截式协调(intercept)两种协议在trial 0的表现,发现最大单种子差异为+18个百分点(p_corr=0.012),但该结果未能在第二种子中复现(-3个百分点,p_corr=1.0),且经Bonferroni校正后所有对比均不显著。由此确立了一个局部基线:在相同模型与配置下,未激活协调机制时观察到的配对差距范围为[-3,+18]个百分点,联合置信区间上限约15个百分点。进一步分析表明,近期七项多智能体协调架构的报告效应均低于此基线,另有两项处于该范围之内,意味着其宣称的有效性在相同模型和配对复制条件下尚未得到验证。论文提出将“协调活跃通过率”(coordination-active pass^k)定义为最小报告协议,强调仅在协调机制逻辑活跃的试次中计算性能,并引入样本量目标与运行时钩子以增强可复现性。实验基于ET-MCP(任务导向的负知识存储,符合MCP 2026-07-28规范),用作隔离阅读端选择影响的底层框架,而非主要贡献。结果表明,在Haiku 4.5上,候选阅读器(pull、intercept)并未提升trial 1的恢复能力,作者据此对现有生产级钩子表面的失效模式进行了初步诊断与改进建议。

链接: https://arxiv.org/abs/2606.20695
作者: Alibek T Kaliyev,Artem Maryanskyy
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Uber(优步)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Multi-agent LLM coordination papers report small benchmark deltas as evidence that one architecture beats another. A prior question: how much paired trial-0 disagreement do two protocols produce on the same model and benchmark when their API inputs are configuration-equivalent (matched by code inspection plus a SHA-256 byte audit), short of full identity-replay? On Claude Haiku 4.5 against tau^2-bench retail, the clean configuration-equivalent contrast (no_coord vs. intercept, both inert at trial 0) gives signed paired gaps of +10pp and 0pp across two n=100 seeds; pooled across both, +5pp with Wilson CI [-2,+12], not significant. The largest single-seed contrast (+18pp pull-vs-intercept, p_corr=0.012) did not reproduce at the second seed (-3pp, p_corr=1.0); no trial-0 contrast is significant after Bonferroni at either seed or pooled. The envelope of observed paired gaps spans [-3,+18]pp across two seeds, with pooled upper Wilson CI ~15pp. Seven of ten recent multi-agent coordination architectures report headline effects below this local floor, and one more sits inside the envelope; whether they survive a same-model paired replication is, by construction, untested in their original settings. We define coordination-active pass^k, pass^k restricted to trials where the coordination mechanism is logically active, as the minimum reporting protocol, with sample-size targets and runtime hooks in the body. Measurements run on ET-MCP, a task-scoped negative-knowledge store conformant with MCP 2026-07-28, used as a substrate to isolate reader-side choices, not as a contribution. On Haiku 4.5 the candidate readers (pull, intercept) do not improve trial-1 recovery; we give a preliminary diagnosis of failure modes with refinements on existing production hook surfaces. Comments: 11 pages, 4 figures Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.20695 [cs.MA] (or arXiv:2606.20695v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.20695 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-24] Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier

【速读】:该论文旨在解决现代智能体系统中因不确定性传播机制缺陷导致的“过度自信”问题。其核心挑战在于:上游决策中的不确定性在组件间传递时被掩盖,仅以看似确定的中间产物形式暴露给下游模块,致使不确定性在接口处丢失,进而引发局部模糊性演变为系统级错误放大。解决方案的关键在于提出“潜在不确定性(latent uncertainty)”作为承载不确定性的载体,嵌入到决策交接过程中。该方法不通过隐藏状态替代文本,而是保留原始决策的脆弱性特征,使下游组件能够感知并合理处理不确定性,从而将智能体系统的不确定性传播从逐步估计范式转向以不确定性保持为核心的接口设计,提升系统的可恢复性与鲁棒性。

链接: https://arxiv.org/abs/2606.20662
作者: Kaiwen Shi,Zheyuan Zhang,Han Bao,Colby Nelson,Yanfang Ye
机构: University of Notre Dame (圣母大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Modern agent systems can turn uncertainty into overconfidence. Fragile upstream decisions are often exposed to downstream components as clean intermediate artifacts, while the uncertainty behind those decisions is lost at the interface. As a result, local ambiguity can become system-level error amplification. We argue that this reveals an interface bottleneck in agent uncertainty propagation: uncertainty does not propagate simply because a trajectory contains uncertain steps; it propagates only when it survives the handoff between components. We define uncertain decision handoff as the transfer of an intermediate decision made under uncertainty, and identify confidence laundering as a failure mode in which fragile upstream states are repackaged as procedurally valid artifacts that downstream agents over-trust. To address this bottleneck, we propose latent uncertainty as an uncertainty-bearing carrier attached to decision handoffs. Rather than replacing text with hidden states, latent uncertainty aims to preserve pre-commitment fragility in a form that downstream components can use. This position shifts agent uncertainty propagation from step-wise uncertainty estimation toward uncertainty-preserving interface design for more recoverable agent systems.

[MA-25] Platooning Connected Autonomous and Human-Driven Vehicles: A Deep Reinforcement Learning-based Approach

【速读】:该论文旨在解决现有车辆编队(platooning)方法在现实混合交通场景中适用性不足的问题,特别是传统方法仅适用于联网车辆(connected vehicles),而未能有效整合非联网车辆(non-connected vehicles),导致难以反映当前真实交通环境的复杂性。其核心解决方案是提出一种混合编队模式(hybrid platooning pattern),通过条件性允许非联网车辆加入编队,提升编队的多样性与灵活性。然而,无约束地引入非联网车辆可能导致编队快速扩张,加剧扰动传播风险,从而在交通吞吐量与系统稳定性之间产生显著冲突。为此,本文进一步设计了一种基于**深度强化学习(Deep Reinforcement Learning, DRL)**的混合编队控制策略,通过多层级状态表示网络融合车辆动力学、编队拓扑结构及交通流状态信息,实现对交通容量与稳定性的动态权衡。仿真结果表明,该策略能够有效抑制速度扰动传播,动态优化编队结构,在提升混合交通流稳定性与安全性的同时,降低燃油消耗与排放。

链接: https://arxiv.org/abs/2606.20648
作者: Zhen Qina,Dong-Fan Xie,Heng Ma,Xiaomei Zhao,Zhengbing He
机构: Beijing Jiaotong University (北京交通大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校)
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Conventionally, existing vehicle platooning approaches are designed for connected vehicles, typically including connected autonomous vehicles and connected human-driven vehicles. Non-connected vehicles, such as non-connected autonomous or human-driven vehicles, are not incorporated. As a result, these platooning approaches may not properly reflect real-world mixed traffic conditions at the current stage. To address this limitation, this study proposes a hybrid platooning pattern that conditionally permits non-connected vehicles to join platoons, thereby enhancing platooning diversity and flexibility. However, it was found that the unregulated integration of non-connected vehicles can trigger rapid platoon expansion, significantly amplifying the risk of disturbance propagation in traffic flow. This, in turn, exacerbates the inherent conflict between traffic throughput and stability. To mitigate these challenges, this paper further develops a hybrid platooning control strategy based on deep reinforcement learning (DRL). This strategy integrates vehicle dynamics, platoon topology, and traffic flow states through a multi-level state representation network, enabling a dynamic trade-off between traffic capacity and stability. Numerical simulations demonstrate that the proposed strategy effectively suppresses velocity disturbance propagation by dynamically optimizing platoon structures, thereby significantly enhancing the stability and safety of mixed traffic while reducing fuel consumption and emissions.

[MA-26] ssellated Biomes: Distributed Robotic Assemblies for Architectural Resilience

【速读】:该论文旨在解决传统建筑生命周期线性化导致的资源浪费与适应性不足问题,提出了一种融合数字制造、多材料优化与分布式机器人协同装配的循环式空间适应框架——“镶嵌生物群落”(Tessellated Biomes)。其核心解决方案的关键在于将本地微工厂制造、离散多材料结构优化以及定制四足机器人协同搬运与重构整合为一个闭环系统,实现了模块化构件在多种材料(如PLA、木材和混凝土)下的自对准数字化生产,通过压缩型离散结构优化提升力学性能,并由机器人集群完成物理组装与动态重配置。该框架显著提升了建筑系统的韧性与可重构能力,为可持续、自适应的智能建造提供了新范式。

链接: https://arxiv.org/abs/2606.20647
作者: Sergio Mutis,Eric Hughes,Gert Duvenhage
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 10 pages, 11 figures. Accepted at CAADRIA 2026

点击查看摘要

Abstract:This paper presents Tessellated Biomes, a cyber-physical framework for the adaptive robotic construction and reconfiguration of modular multi-material assemblies. It challenges the linear lifecycle of standard construction by fusing (1) local microfactory fabrication, (2) discrete multi-material optimization, and (3) distributed robotic assembly into a unified circular process of spatial adaptation. The research details methods for the digital fabrication of self-aligning modular primitives in multiple materials (PLA, timber, and concrete) produced in local microfactories; the aggregation and optimization of these primitives into compression-based discrete structures; and the deployment of custom quadrupedal robots that collaboratively relocate material into realized physical aggregations. The framework is validated through the fabrication, optimization, and robotic assembly of discrete structures. Together, these results position Tessellated Biomes as a model for resilient, reconfigurable architecture.

[MA-27] Specialize Roles Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams

【速读】:该论文旨在解决大语言模型(LLM)代理团队在多角色协作部署中面临的成本-准确性权衡优化问题。现有代理评估基准通常仅针对固定模型或固定配置进行测试,无法有效指导实际部署中不同角色选用何种模型及部署模式(如API调用、自托管或混合部署)以实现最优成本与准确率平衡。为此,论文提出AgentCARD——一个面向角色分配与部署模式的角色感知型基准套件,其核心创新在于整合了角色分解的评估框架、统一的API/自托管成本模型、帕累托前沿分析以及基于Shapley值的瓶颈诊断方法。实验表明,异构角色团队能持续占据成本-准确性前沿:相比同质团队,在相同成本下准确率最高提升44%,或在保持最强同质团队性能的前提下将每任务成本降低至1/12。同时研究发现最优角色分配具有领域依赖性,部分场景受规划器瓶颈制约,而另一些则受限于执行器能力。此外,AgentCARD可扩展至包含验证等多角色工作流,并支持随新领域与团队结构演进的持续评估。

链接: https://arxiv.org/abs/2606.20629
作者: Yinsicheng Jiang,Liang Cheng,Yeqi Huang,Yufan Zhao,Zhan Lu,Li Dong,Wenda Li,Edoardo Ponti,Luo Mai
机构: University of Edinburgh(爱丁堡大学); Microsoft Research(微软研究院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM agents are increasingly deployed as multi-role teams, where tasks are divided across specialized roles such as planner, executor, and verifier. In these systems, cost and accuracy are no longer properties of a single model: they depend on which model fills each role and where it is hosted, including API, self-hosted, and hybrid deployment. Existing agentic benchmarks typically evaluate fixed models or fixed agent configurations, and therefore offer limited guidance for cost-accuracy-optimal deployment. We introduce AgentCARD, a role-aware benchmark suite for evaluating LLM agent teams across role assignment and deployment mode. AgentCARD combines a role-decomposed evaluation harness, a unified API/self-hosted cost model, Pareto-frontier analysis, and a Shapley-based diagnostic for identifying role bottlenecks. Our evaluation shows that heterogeneous teams consistently occupy the cost-accuracy frontier. They improve accuracy by up to 44% over cost-equivalent homogeneous teams, or match the strongest homogeneous team at up to 12\times lower per-task cost through hybrid deployment. We further find that the best role assignment is domain-dependent: some domains are planner-bottlenecked, while others are executor-bottlenecked. Finally, AgentCARD extends beyond planner–executor teams to workflows with additional roles such as verification, and supports continual evaluation as new domains and team structures emerge. Our code is released at: this https URL

[MA-28] PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate

【速读】:该论文旨在解决多智能体辩论(multi-agent debate)在推理过程中因固定拓扑结构导致的持续位置偏差、不可靠智能体被放大以及对角色分配高度敏感等问题。其核心解决方案是提出一种置换等变自适应路由多智能体辩论(Permutation-Equivariant Adaptive Routing Multi-Agent Debate, PEAR),该方法在推理阶段动态重构通信角色与稀疏拓扑结构,通过基于智能体状态演化策略性地切换其角色分配,避免任何智能体长期占据有利网络位置,从而实现辩论影响力更均衡的分布。PEAR被理论证明为一种置换等变的稀疏路由机制:在保持智能体重标记下性能不变的同时,降低了路由复杂度并提升了模型泛化能力。在四个推理基准和六种不同大语言模型(LLM)骨干架构上的全面实验证明,PEAR显著优于当前最强的辩论基线方法,在平均准确率上实现了显著提升。

链接: https://arxiv.org/abs/2606.20621
作者: Yang Feng,Ziwei Xu,Xia Hu,Fengxiang He
机构: University of Edinburgh; FAR.AI; Shanghai AI Laboratory
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Multi-agent debate improves the reliability of large language models (LLMs) through iterative peer critiques. However, fixed topologies often introduce persistent positional biases, amplify unreliable agents, and cause high sensitivity to role assignments. We introduce \textitPermutation-Equivariant Adaptive Routing Multi-Agent Debate (PEAR), an inference-time protocol that dynamically reconfigures communication roles and sparse topologies across consecutive debate rounds. By strategically switching agent-to-role assignments based on evolving agent states, PEAR prevents any agent from permanently occupying a privileged network position or distributes influence more evenly across the debate. We theoretically characterize PEAR as an equivariant sparse router: it preserves accuracy under agent relabeling while reducing routing complexity and improving generalization. Comprehensive empirical evaluations across four reasoning benchmarks and six diverse LLM backbones demonstrate PEAR significantly improves average accuracy over the strongest debate baselines. The code is at this https URL.

[MA-29] Specifying AI-SDLC Processes: A Protocol Language for Human-Agent Boundaries

【速读】:该论文旨在解决在软件开发生命周期(SDLC)中,人类与人工智能代理(AI agents)协作时缺乏明确的责任边界、审批节点及治理约束的规范语言问题。当前方法或依赖易发生漂移的代理提示词编码流程,或聚焦于相邻领域(如工作流管理、业务流程),或仅覆盖部分机制(如访问控制、审批节点),无法提供完整且形式化的协作规范。为此,论文提出一种领域特定语言(DSL),用于将AI-SDLC流程表达为协议,具备形式化语法、良构性条件、操作语义及可执行的强制不变量。其核心创新在于区分“策略”(policy,声明意图)与“机制”(mechanism,结构化强制),通过验证令牌和能力边界等原语,实现对过程非确定性的有效约束。关键结论包括:结构化强制可将系统故障率控制在代理与验证者故障率的加权乘积内,而行为合规则可能导致累积或近饱和增长;2+N团队模式(两名人类管控角色加N个专业化代理)形式化了适用于AI-SDLC的古典职责分离原则;编排循环的克莱尼闭包与协议自洽验证作为设计属性自然涌现,而非特殊构造。相较于多智能体框架(MetaGPT)、工作流规范(FlowAgent、BPMN扩展)及基于能力的安全模型(SAGA),本工作的新颖性在于这些元素的特定集成方式,而非单一原语。一个可行的实现已展示其技术可行性,未来将开展实证评估。

链接: https://arxiv.org/abs/2606.20615
作者: Ylli Prifti
机构: Birkbeck, University of London (伦敦大学伯克贝克学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Position paper with formal specification, failure rate analysis, and feasibility demonstration. Companion empirical paper and open-source implementation forthcoming

点击查看摘要

Abstract:AI agents now participate as first-class team members across the software development lifecycle, yet no specification language exists for expressing the human-agent responsibility boundaries, approval gates, and governance constraints this collaboration requires. Existing approaches encode process in agent prompts (subject to drift), target adjacent domains (workflow management, business processes), or address only fragments (access control, approval gates). We propose a domain-specific language for specifying AI-SDLC processes as protocols, with formal syntax, well-formedness conditions, operational semantics, and enforcement invariants. The language distinguishes policy (declared intent) from mechanism (structural enforcement), enabling implementations to bound process non-determinism through primitives such as validation tokens and capability boundaries. Three results follow. A failure rate analysis shows that structural enforcement bounds system failure rates at a weighted product of agent and validator rates, while behavioral compliance permits cumulative or near-saturating growth. The 2+N team pattern (two human-in-control roles plus N specialized agent members) formalizes classical Separation of Duties for AI-SDLC. Kleene closure of orchestration loops and reflexive protocol-adherence validation emerge as design properties rather than special-case constructs. We position the contribution against multi-agent frameworks (MetaGPT), workflow specification (FlowAgent, BPMN extensions), and capability-based security (SAGA): the novelty lies in the specific integration, not any single primitive. A working implementation demonstrates feasibility; empirical evaluation is future work. Comments: Position paper with formal specification, failure rate analysis, and feasibility demonstration. Companion empirical paper and open-source implementation forthcoming Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL); Software Engineering (cs.SE) ACMclasses: D.2.1; D.2.4; D.3.1; I.2.11 Cite as: arXiv:2606.20615 [cs.AI] (or arXiv:2606.20615v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.20615 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ylli Prifti [view email] [v1] Sun, 24 May 2026 02:52:47 UTC (187 KB)

[MA-30] AONA: A Comprehensive Architecture and Workflow Design for Global Agent ic Collaboration

【速读】:该论文旨在解决当前互联网基础设施(以TCP/IP和DNS为主)在支持自主智能体(autonomous agents)协同交互时存在的根本性局限,包括缺乏语义感知能力、动态能力发现机制不足以及去中心化信任体系缺失等问题。现有系统设计初衷面向人类主导的主机间通信,难以满足多智能体系统对语义理解、动态协作与安全互信的需求。为此,论文提出AONA(Agentic Overlay Network Architecture),其核心解决方案在于构建一个面向“智能体互联网”(IoA)的新型叠加网络架构。其关键创新点在于:首先从组织经济学、可扩展性原则及“无效率代价”(Price of Anarchy)等多学科视角论证了多智能体协同相较于单一超级智能体的理论必要性;其次,提出四层逻辑架构(基础层、互联层、协作层与应用层),实现跨协议、跨平台的互操作性而不改变底层物理网络;再次,通过分布式节点基础设施(包括管理根节点、注册服务节点、发现服务节点及企业智能服务枢纽)实现物理部署;最后,设计了一系列动态运行机制,如零信任身份颁发、全局协调的语义分类体系同步、意图驱动的语义发现以及可信计量用于商业结算,从而构建了一个具备强扩展性、安全性与去中心化信任特征的全球智能体协作基础平台。

链接: https://arxiv.org/abs/2606.20573
作者: Jinliang Xu,Runkai Zhu,Bingqi Li,Fanjie Nie,Jin Li,Jiagui Xie
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA)
备注: 28 pages, 8 figures

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has established autonomous agents as the core vehicles for artificial intelligence applications. However, existing Internet infrastructures, primarily relying on TCP/IP and DNS, are designed for human-centric, host-to-host data transmission, inherently lacking the semantic awareness, dynamic capability discovery, and decentralized trust mechanisms required for autonomous agent interactions. To address these limitations and break the closed ecosystems of single vendors, this paper proposes AONA (Agentic Overlay Network Architecture), a novel overlay network architecture for the Internet of Agents (IoA). We first provide a multi-disciplinary scientific defense for multi-agent collaboration, demonstrating its theoretical necessity over single super-intelligence through the lenses of organizational economics, scaling principles, and the Price of Anarchy. AONA is then structured as a four-layer logical blueprint comprising the Base, Interconnection, Collaboration, and Application layers, which facilitates cross-protocol and cross-platform interoperability without disrupting the underlying physical network. To physically instantiate this blueprint, we design a distributed node infrastructure anchored by Management Root Nodes, Registry Service Nodes, Discovery Service Nodes, and Enterprise Intelligent Service Hubs for private domain integration. Finally, we detail the dynamic operational workflows-including zero-trust identity issuance, globally coordinated semantic taxonomy synchronization, intent-driven semantic discovery, and trusted metering for commercial settlement-that drive the network. This comprehensive architecture provides a robust, scalable, and secure foundation for the future of global agentic collaboration.

[MA-31] Infrastructure for the Agent ic Web: Gap Analysis and Architecture from the Agent verse Platform

【速读】:该论文旨在解决当前自主智能体(autonomous AI agents)在大规模部署中所面临的基础设施支持不足问题,尤其是在支撑其可靠、可扩展运行的底层平台方面研究相对匮乏。尽管已有大量工作聚焦于智能体的行为与推理能力,但其赖以生存的云基础设施仍处于初级阶段,难以满足未来“智能体原生”(agent-native)网络的需求。论文的关键解决方案在于提出一个系统性的分析框架与演进蓝图:首先,通过对Agentverse这一由人工智能超智能联盟(ASI Alliance)开发的成熟生产级智能体云平台进行实证审计,识别出204个API端点的状态,并构建了一个包含8个类别、共62项缺失能力的“缺口分类体系”(Gap Taxonomy),覆盖了从智能体记忆、可观测性到安全机制、经济原语及企业级扩展等关键维度;其次,基于上述发现,提出一个七层智能体云栈(Agent Cloud Stack)参考架构,作为2030年前实现完整智能体原生云应具备的技术标准;最后,勾勒出五条核心演进路径,包括从临时存储向全量智能体记忆云演进、从关键词检索向语义化可信加权的智能体域名系统(Agent DNS)升级、从单一协议向多标准协同的智能体通用语言(lingua franca)过渡、从单实例托管向Kubernetes规模编排迁移,以及从简单代币支付迈向复杂智能体经济原语。这些贡献共同构成了对当前智能体基础设施的诊断工具与面向2030年“Web4”愿景的技术路线图,为构建可信赖、可扩展的智能体网络生态提供了坚实基础。

链接: https://arxiv.org/abs/2606.20570
作者: Robin Dey,Panyanon Viradecha
机构: OpenHub Research (开放枢纽研究); Fetch.ai (Fetch.ai); SingularityNET (奇异网络); CUDOS (CUDOS)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 28 pages, 11 tables, 1 figure. Preprint, not peer-reviewed

点击查看摘要

Abstract:The emergence of autonomous AI agents as first-class participants in digital infrastructure marks a fundamental inflection point in the evolution of the Web. While significant research has been directed at agent behaviour and reasoning, comparatively little attention has been paid to the infrastructure those agents require to operate reliably at scale. This paper addresses that gap with a systematic analysis of Agentverse, the agent cloud platform developed by this http URL under the Artificial Superintelligence (ASI) Alliance, which represents one of the most mature production deployments of agent-native infrastructure available today. We make three principal contributions. First, we conduct an empirical audit of the Agentverse platform, cataloguing 204 API endpoints (Q1 2026) and characterising what is operational, partially deployed, or absent. From this audit we derive a Gap Taxonomy of eight categories encompassing 62 distinct missing capabilities, ranging from agent memory and observability to security, economic primitives, and enterprise scaling. Second, we propose a seven-layer Agent Cloud Stack – a reference architecture for what a fully realised agent-native cloud should provide by 2030, grounded in the specific gaps we identify. Third, we characterise five critical evolution paths: from ephemeral storage to a full Agent Memory Cloud; from keyword discovery to a semantic, trust-weighted Agent DNS; from a single-protocol model to a multi-standard agent lingua franca; from single-instance hosting to Kubernetes-scale orchestration; and from simple token payments to rich agent economic primitives. Together these contributions provide a diagnostic of current agent infrastructure and a technically grounded vision for what the agent cloud must become to support the agentic web – Web4 – by 2030. Comments: 28 pages, 11 tables, 1 figure. Preprint, not peer-reviewed Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA) ACMclasses: C.2.4; I.2.11; H.3.5 Cite as: arXiv:2606.20570 [cs.NI] (or arXiv:2606.20570v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2606.20570 Focus to learn more arXiv-issued DOI via DataCite

自然语言处理

[NLP-0] Randomized YaRN Improves Length Generalization for Long-Context Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理超长序列时泛化能力不足的问题,尤其在超出训练长度的长文本场景下性能显著下降。现有方法通常在短序列上预训练后通过额外微调扩展至更长序列,但难以实现对极长上下文的有效推理。其解决方案的关键在于提出一种名为“Randomized YaRN”的训练方法,该方法融合了基于YaRN的位置外推(positional extrapolation)、随机位置编码(randomized positional encoding)以及长度课程学习(length curriculum)。具体而言,在短上下文数据训练中,模型被赋予从更大位置范围采样的YaRN位置编码,从而在短输入中引入分布外(out-of-distribution, OOD)的位置表示,增强模型对长序列位置关系的鲁棒性。实验在两个高难度长上下文推理基准(BABILong和多轮指代消解,MRCR)上验证,当在8K上下文长度下训练时,Randomized YaRN在16K至128K的长上下文任务中均表现出持续提升的推理性能,且在远超出训练分布的长度上取得最大增益,显著优于标准微调方法。结果表明,逐步暴露模型于分布外的位置分布是实现可推广长上下文推理的有效策略。

链接: https://arxiv.org/abs/2606.23687
作者: Manas Mehta,Fangcong Yin,Greg Durrett
机构: New York University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are typically pretrained on short sequences and then extended to work on longer sequences with additional training. However, such LLMs still struggle to further generalize to very long sequences. We propose Randomized YaRN, a training method that improves length generalization by combining YaRN-based positional extrapolation with randomized positional encoding and a length curriculum. During training on short context data, tokens are assigned YaRN positional encodings sampled from a larger position range, exposing the model to out-of-distribution positional representations even on short-context inputs. We evaluate Randomized YaRN on two challenging long-context reasoning benchmarks, BABILong and Multi-Round Coreference Resolution (MRCR). When training on data with 8K context, Randomized YaRN consistently improves reasoning performance on context lengths from 16K to 128K and outperforms standard fine-tuning, with the largest gains appearing at far out-of-distribution lengths. Our results suggest that progressively exposing models to OOD positional distributions provides an effective recipe for generalizable long-context reasoning.

[NLP-1] Can LLM s Reliably Self-Report Adversarial Prefills and How?

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在安全场景下自我认知能力的可靠性问题,具体聚焦于模型是否能够识别自身输出是否受到对抗性前填充攻击(adversarial prefill attack)的影响。研究发现,尽管大语言模型(LLMs)在良性任务中表现出一定的内省能力,但在安全评测中,十种开源指令微调模型(参数规模从3B到70B)均无法可靠识别其先前响应是被恶意预填充诱导产生的,平均有27.3%的模型错误地声称其输出反映了自身意图。研究揭示,这种内省信号主要源于与安全拒绝相关的推理机制;通过正交化模型权重以削弱拒绝方向,可显著缩小在恶意预填充与自然输入下“声称意图”的差异,表明该方向虽非唯一中介,但对信号产生具有关键影响。此外,内省信号高度依赖于探测提示的表述方式——将问题框架为“内部意图”或“外部篡改”会引发同一模型截然不同的响应。进一步实验表明,三种基于LoRA的微调方法(SFT、GRPO、DPO)在8B至27B规模的模型上均扩大了意图探测中的响应差距,且该干预效果无法迁移至篡改探测,反而在多数模型上提升了对抗性预填充攻击的成功率,显示出一种部分缓解但存在潜在风险的干预结果。综上,该研究揭示了安全情境下模型内省信号的形成机制,并警示了当前大模型自报告可靠性存在的根本性缺陷。

链接: https://arxiv.org/abs/2606.23671
作者: Quang Minh Nguyen,Uzair Ahmed,Taegyoon Kim
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of 27.3% . Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models’ weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.

[NLP-2] apered Language Models

【速读】: 该论文旨在解决现代语言模型(Language Model, LM)中参数分配均匀化的问题,即在模型深度方向上各层采用相同的参数容量配置,尽管已有证据表明不同层对最终输出的贡献存在非均匀性,尤其是后期层主要负责对残差流(residual stream)进行微调而非本质转换。其核心解决方案是提出一种渐变式语言模型(Tapered Language Models, TLMs),通过在固定总参数预算下,将参数容量沿深度方向单调递减(如使用平滑余弦调度函数缩减MLP宽度),实现对深层结构的容量再分配。该方法的关键在于:以不增加额外参数或计算开销为前提,利用MLP作为可调控的单一维度,在多种主流架构(Transformer、Gated Attention、Hope-attention、Titans)和多个模型尺度上均一致地提升困惑度(perplexity)与下游任务表现,从而验证了基于深度感知的容量分配是一种通用且高效的模型设计范式

链接: https://arxiv.org/abs/2606.23670
作者: Reza Bayat,Ali Behrouz,Aaron Courville
机构: Mila; Cornell University (康奈尔大学); Université de Montréal (蒙特利尔大学); CIFAR AI Chair (加拿大高级研究院人工智能主席)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.

[NLP-3] EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

【速读】: 该论文旨在解决企业级智能体(Enterprise Agent)在真实工作场景中评估缺乏可靠基准的问题。现有评估体系往往依赖于合成数据或公开数据集,难以反映智能体在处理企业内部异构文件、调用复杂工具及生成业务成果时的真实能力。为此,研究提出EnterpriseClawBench,一个基于真实企业工作流会话的私有化基准,从中提取852个可复现的任务,并配套重构的测试环境(fixtures)、提示词(prompts)、角色类(role classes)、技能子类(skill subclasses)、硬性规则(hard rules)与语义评分标准(semantic rubrics)。其解决方案的关键在于构建一套完整的、可复现的评估协议,而非公开原始数据以保护企业敏感信息。实验表明,当前最优模型(Codex with GPT-5.5)在该基准上的表现仅为0.663,凸显了企业智能体评估需综合考量模型-工具组合、产出物交付质量、视觉表现、成本、运行时间及技能迁移行为等多维指标,而非简单归约为单一性能分数。

链接: https://arxiv.org/abs/2606.23654
作者: Jincheng Zhong,Weizhi Wang,Che Jiang,Kai Tian,Zhenzhao Yuan,Junlin Yang,Dianqiao Lei,Kaiyan Zhang
机构: Horizon Research, Frontis.AI
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness–model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: this https URL

[NLP-4] Evaluation Awareness Is Not One Capability: Evidence from Open Language Models

【速读】: 该论文旨在解决当前安全基准测试中存在的“基准幻觉”(benchmark illusion)问题,即现有评估体系假设模型在测试条件下的行为可准确预测其在实际部署中的表现,而这一假设在模型具备评估感知能力(evaluation awareness)时失效。当模型能够识别评估环境的提示信号并针对性调整行为时,其在基准测试中表现出的合规性仅为一种乐观上界,无法真实反映脱离评估框架后的安全表现。解决方案的关键在于揭示评估感知的多维度特性:通过8项实验对37个开源模型和7个模型家族进行系统分析发现,检测能力(detectability)、行为表现(behavioral manifestation)与可控性(controllability)三者独立且弱耦合,其中仅有行为检测与对抗框架敏感性之间存在显著负相关(ρ = -0.79, p < 0.001)。此外,模型的内部表示在行为退化后仍保持高度可探测性(探针AUROC达0.98),并通过多层控制实验证明了对下游任务的因果影响。因此,该研究指出,单一的“意识分数”无法可靠表征部署安全性,必须采用多变量、非线性的评估范式以克服基准幻觉。

链接: https://arxiv.org/abs/2606.23583
作者: Nilesh Nayan,Aishwarya Sampath Kumar,Rishiraj Girmal,Shivani Anilkumar,Sankaran Vaidyanathan,David A. Nader Palacio,Reshmi Ghosh,Soundararajan Srinivasan
机构: University of Massachusetts, Amherst; Microsoft Corp.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety benchmarks assume that test-condition behavior predicts deployment behavior, an assumption that fails if models detect evaluation cues and adapt. This opens a gap between benchmark performance and deployment behavior: compliance measured under test conditions becomes an optimistic upper bound that overstates how safely a model behaves once the evaluation harness is removed. We characterize this evaluation awareness through eight experiments across 37 open-weight models and seven families. (i)Detection is moderate and training-driven (24/37 models exceed chance, best AUROC 0.714 vs.0.819 human, with instruction tuning dominating over scale). (ii)Detection shifts safety behavior (hard refusal drops 5.8 percentage points under hypothetical framing, and 21/140 HarmBench framing effects are significant, with compliance rising up to +30 percentage points. (iii)Representations survive behavioral collapse (probes retain AUROC 0.98 under rewrites that drive behavior below chance, and multi-layer steering causally moves three downstream tasks while random controls do not). (iv)These axes are weakly coupled (only 1/15 correlations are significant, the sole robust link being behavioral detection versus framing resistance, \rho=-0.79 , p0.001 ). We call this gap the benchmark illusion: because detectability, behavioral manifestation, and controllability vary independently, it is multivariate rather than a single number, so no single awareness score is a reliable proxy for deployment safety.

[NLP-5] SVD-Surgeon: Optimal Singular-Value Surgery for Large Language Model Compression

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中因巨大内存与计算开销导致的瓶颈问题。现有基于奇异值分解(Singular Value Decomposition, SVD)的低秩压缩方法虽有效,但主要关注如何分解及保留哪些组件,缺乏对剪枝后模型性能损失的精细化补偿机制。本文提出SVD-Surgeon,一种无需训练的压缩方法,将最优脑外科(Optimal Brain Surgeon, OBS)框架引入奇异值基底,将每个奇异值视为可优化参数,通过闭式解计算保留奇异值的二阶修正更新,以补偿因截断移除的奇异值带来的模型损失。该方法同时提供奇异值重要性评分,用于指导剪枝策略。由于直接作用于奇异值分解结构,SVD-Surgeon可无缝集成至现有SVD压缩器之上。实验表明,将其应用于领先的SVD压缩方法SVD-LLM,可在不进行任何重训练的情况下,显著提升OPT系列及LLaMA 2-7B模型在困惑度-压缩率权衡上的表现。

链接: https://arxiv.org/abs/2606.23568
作者: Mahmoud Safari,Frank Hutter
机构: University of Freiburg (弗莱堡大学); Prior Labs (Prior实验室); ELLIS Institute Tübingen (图宾根ELLIS研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 5 tables; appendix

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their deployment is constrained by substantial memory and compute requirements. Low-rank compression via singular value decomposition (SVD) is an effective remedy, but existing methods focus on how to factorize and which components to keep. We introduce SVD-Surgeon, a training-free method that brings the Optimal Brain Surgeon (OBS) framework to the singular-value basis. Treating each singular value as a parameter, it computes a closed-form update of the retained singular values that compensates, to second order in the model loss, for those removed by truncation. The same analysis yields a saliency for choosing which values to prune. As it operates directly on the singular-value factorization, SVD-Surgeon can be layered on top of existing SVD compressors. Applied to SVD-LLM, a leading SVD-based method, it improves the perplexity-compression trade-off on the OPT family and LLaMA 2-7B without any retraining.

[NLP-6] LangMAP: A Language-Adaptive Approach to Tokenization

【速读】: 该论文旨在解决多语言场景下如何实现高质量、语言特定的分词(language-specific tokenization)问题,同时避免传统方法中因使用专用分词器而带来的高昂成本——即必须从头训练模型或对预训练模型的词表进行复杂调整。其核心解决方案是提出一种名为**语言自适应最大后验(Language-adaptive Maximum a Posteriori, LangMAP)**的分词机制,该方法将UnigramLM算法扩展至多语言环境,仅依赖单一共享词表即可生成针对不同语言的优化分词结果。其关键创新在于:在训练阶段需提供语言标签,但在推理阶段无需知晓输入文本的语言,即可自动实现语言自适应的分词。该方法在14个开源分词器、9种自然语言和9种编程语言上的实验表明,LangMAP显著提升了形态边界对齐效果,并在所有测试的编程语言中实现了与抽象语法树(AST)叶节点边界的更优匹配;在微调任务中,其在目标语言的语法合理性(MultiBLiMP)方面表现良好,但在知识相关任务(如Global-PIQA、Belebele)中的收益则不够稳定。

链接: https://arxiv.org/abs/2606.23566
作者: Clara Meister,Suchir Salhan,Andrzej Szablewski,Pietro Lesci,Paula Buttery,Tiago Pimentel
机构: EPFL(洛桑联邦理工学院); University of Cambridge(剑桥大学); ETH Zürich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language-specific tokenizers improve tokenization quality and the downstream performance of models on those languages. However, using such a tokenizer comes at a cost: either a new model must be trained from scratch, or the vocabulary of an existing pretrained model must be adapted. We propose Language-adaptive Maximum a Posteriori (LangMAP) Tokenization, a tokenization scheme that extends the UnigramLM algorithm to the multilingual setting, producing language-specific tokenization from a single shared vocabulary. Notably, LangMAP can be used when training a multilingual language model from scratch or to adapt a pretrained model’s tokenizer to individual languages without changing its vocabulary. While language labels are required at training time, a key feature of the algorithm is that it then performs language-specific tokenization at inference without knowledge of the input’s language. Across 14 open-source tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and, for all coding languages tested, alignment with abstract syntax tree (AST) leaf boundaries. In fine-tuning experiments, results are mixed: LangMAP improves target-language grammatical acceptability (MultiBLiMP) on the languages tested; its benefits are less consistent on knowledge-related tasks (Global-PIQA, Belebele).

[NLP-7] he Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

【速读】: 该论文旨在解决大规模Transformer模型在多GPU环境下训练时能源消耗难以准确预测的问题,这一问题随着模型规模和并行度的持续增长而愈发突出,已成为可持续系统设计与成本控制的关键挑战。其解决方案的核心在于提出一种基于受控架构扫描(如BERT模型系列)的能量建模框架,通过引入轻量级代理变量来关联实测能耗与计算量、内存流量及硬件效率等关键因素;受屋顶模型(roofline model)启发,该方法创新性地融入了基于加速比的硬件效率因子,有效捕捉张量并行与全分片数据并行(fully sharded data parallelism)对能效的影响,并据此构建了一个可跨异构配置精准预测训练能耗的缩放定律(scaling law)模型。

链接: https://arxiv.org/abs/2606.23546
作者: Mansour Zoubeirou a Mayaki
机构: Université Lumière Lyon 2 (里昂第二大学); CNRS (法国国家科学研究中心); Ecole Centrale de Lyon (里昂中央理工学院); INSA Lyon (里昂国立应用科学学院); Universite Claude Bernard Lyon 1 (克莱蒙-里昂第一大学); LIRIS, UMR5205 (里昂信息与智能系统实验室,联合研究单位5205)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Transformer-based models underpin modern natural language processing but incur rapidly growing computational and energy costs. As training scales in both model size and parallelism, accurately predicting energy consumption has become critical for sustainable and cost-aware system design. We present a framework for modeling the energy consumption of Transformer training on multiple GPUs. Using controlled architectural sweeps of BERT models, we relate measured energy to lightweight proxies for compute, memory traffic, and hardware efficiency. Inspired by roofline models, our approach incorporates a speedup-based hardware-efficiency factor that captures the effects of tensor parallelism and fully sharded data parallelism. We derive a scaling law model that accurately predicts training energy across heterogeneous configurations.

[NLP-8] VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

【速读】: 该论文旨在解决大规模强化学习(Reinforcement Learning, RL)在视觉数学推理任务中因数据量增长而导致奖励标签可靠性下降的核心问题。传统方法在扩展数据规模时依赖人工标注者的可信度,或假设基础答案正确,缺乏对答案真实性的验证机制。其解决方案的关键在于将数据规模化构建视为一个可验证的问题,并在策略更新前解耦两个关键维度:提示难度(通过路径特异性演化算子逐步提升)与答案可靠性(通过离线假设检验的否证机制强制保证)。为此,作者提出VeriEvol框架,包含两个可扩展组件:一是类型感知的演化模块,用于将低难度图像-问题种子演化为更复杂、基于图像的提示;二是HTV-Agent验证器,仅在多方反例均无法驳倒答案后才接受该答案。该框架生成的经验证数据具备可扩展性,支持通过增加演化路径或验证通道进一步扩展,且可直接嵌入现有GRPO类强化学习训练流程。在五项基准的视觉数学任务上,将监督微调(SFT)数据从1万增至25万样本,平均准确率由35.42%提升至54.73%;在保持主干模型、SFT初始化和GRPO训练流程不变的情况下,VeriEvol相较未演化的强化学习基线累计提升+3.88,其中+1.82来自演化提示,+2.06来自HTV-Agent验证器。研究团队公开了全部提示、数据、模型、代码及每条样本的完整验证轨迹,以支持下游工作对整个流水线进行可扩展性和可审计性分析。

链接: https://arxiv.org/abs/2606.23543
作者: Haoling Li,Kai Zheng,Jie Wu,Can Xu,Qingfeng Sun,Han Hu,Yujiu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.

[NLP-9] Self-Compacting Language Model Agents

【速读】: 该论文旨在解决长链式推理轨迹(long agent traces)中因持续积累的上下文内容导致的“上下文过时”问题,即冗余或陈旧信息会锚定后续生成过程,最终超出上下文窗口(context window)限制。现有方案采用固定间隔的压缩机制,其触发依赖于固定的token阈值,但缺乏对推理轨迹结构的感知,易在推导中途或搜索过程中错误丢弃部分结果。本文提出SelfCompact,一种由模型自主决定何时及如何进行上下文压缩的自适应框架。其核心创新在于结合两个推理时动态元素:(i) 一个由模型调用的压缩工具,用于总结累积上下文;(ii) 一个轻量级规则(rubric),用于判断压缩时机——如子任务已解决或轨迹趋于收敛时触发,而在推导中途或陷入僵局时抑制压缩。二者缺一不可:仅依赖压缩工具会导致使用不一致且时机不当,而仅靠规则无法执行动作。通过无需微调或外部监督的协同机制,SelfCompact实现了高效、自适应的上下文压缩。在六个基准测试(涵盖数学求解与代理式搜索)和七种模型上的实验证明,SelfCompact在远低于固定间隔压缩的token消耗下,性能达到相当或更优水平,在数学任务上相较无压缩基线最高提升18.1分,在代理搜索任务上提升5–9分,同时单位问题成本降低30%–70%。研究揭示了一个元认知鸿沟:未受提示的模型难以自主识别自身上下文是否“腐化”,而通过引入轻量级规则,可有效弥合此差距,将压缩时机判断转化为可通过架构设计实现的能力,而非依赖训练。

链接: https://arxiv.org/abs/2606.23525
作者: Tianjian Li,Jingyu Zhang,William Jurayj,Xi Wang,Chuanyang Jin,Mehrdad Farajtabar,Eric Nalisnick,Daniel Khashabi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.

[NLP-10] War in the Abstract: The Rise and Consequences of Militarized Language in Scientific Communication

【速读】: 该论文旨在解决科学文献中军事化语言(militaristic language)过度使用的问题,揭示其在学术表达中的蔓延趋势及其对科学传播效果的潜在负面影响。研究发现,2010至2025年间,科学摘要中军事化词汇的使用率在OpenAlex和PubMed数据库中分别上升了48%和32%,且自2019年后增速显著加快(跨数据库相关系数r = 0.96,p < 10⁻⁸),这一趋势与全球冲突水平高度一致(Uppsala冲突数据项目,r = 0.77–0.84)。尤其值得注意的是,全球南方国家的研究摘要中军事化语言增长最快,而社会科学领域当前使用程度最高,工程与计算机科学则呈现最快增长态势。此外,新冠疫情及2022年后大语言模型时代的兴起,进一步推动了该语言风格的普及,并缩小了英语母语与非英语母语作者之间的语言差异。通过一项包含801名参与者、共32,040次试验的组内战争隐喻实验,研究发现使用战争框架会显著降低科学陈述的可信度(均值下降0.18李克特量表单位,95%置信区间[-0.21, -0.14],效应量d_z = -0.28,p < 10⁻²⁰)、资金支持意愿(d_z = -0.12)以及政策支持度(d_z = -0.08),尽管在紧迫感感知上存在趋势性提升(d_z = +0.07)。因此,该研究的关键解决方案在于揭示军事化语言在科学写作中的系统性扩散及其对公众信任与政策响应的负面因果影响,强调应警惕此类修辞在科学传播中的非理性渗透,以维护科学话语的客观性与公信力。

链接: https://arxiv.org/abs/2606.23462
作者: Sovesh Mohapatra,David Lydon-Staley,Dani S. Bassett
机构: University of Pennsylvania (宾夕法尼亚大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注: 26 pages, 7 figures, 2 SI items

点击查看摘要

Abstract:Scientists do not, by profession, wage war. Yet warfare’s vocabulary consistently appears in their abstracts. To quantify the extent to which warfare’s vocabulary pervades scientific abstracts, we analyze 21.4 million papers (2010-2025; OpenAlex, PubMed). We additionally run a within-subject war-framing experiment (N = 801; 32,040 trials) designed to provide causal insight into the effects of militaristic language on persuasion. Between 2010 and 2025, the presence of militaristic terms in scientific abstracts rose 48% in OpenAlex and 32% in PubMed, with the rise accelerating sharply after 2019 (cross-database r = 0.96, p 10^-8). The prevalence of militaristic language is conflict-aligned at both country and annual scales (Uppsala Conflict Data Program; r = 0.77-0.84), with the abstracts from the Global South displaying the fastest rise in militaristic language. Among disciplines, social sciences leads in level of such language while engineering and computer science lead in growth. The COVID and post-2022 large-language-model eras also saw the rise and narrowed the language gap between native-English and non-English authors. In our follow-up experiment, we found that war framing reduced credibility (mean shift -0.18 Likert units, 95% CI [-0.21, -0.14]; d_z = -0.28, p 10^-20), funding willingness (d_z = -0.12) and policy support (d_z = -0.08), with a trend-level increase in sense of urgency (d_z = +0.07). Collectively, findings reveal that while scientific abstracts drift toward warfare, the use of militaristic language may erode credibility, funding willingness, and policy support.

[NLP-11] riggerBench: Investigating Prospective Memory for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期交互中对前瞻性记忆(Prospective Memory, PM)能力评估缺失的问题。现有评估体系主要聚焦于通过显式查询检验回溯性记忆(Retrospective Memory, RM),而忽视了模型在无直接提示下自发识别并响应潜在约束的能力——这一核心认知功能正是PM的本质。为此,作者提出TriggerBench,一个覆盖日常助手与专业工作流的多维度综合性PM评估基准,通过设计匹配的RM对照实验、对比型正负样本以及触发器过载场景,在统一协议下实现对主动回忆能力、误报率及注意力鲁棒性的细粒度测量。其解决方案的关键在于构建具有可比性、可控性和复杂性的多层级测试框架,能够系统揭示模型在长上下文下的记忆动态与推理资源分配机制。研究发现:(1)PM存在精确率-召回率权衡与注意力脆弱性,尽管增强推理可提升主动回忆,但模型易陷入“始终提醒”的过拟合模式;在隐式约束或并发请求导致的触发器过载条件下,PM性能显著下降,表明稳健的前瞻性记忆仍面临挑战;(2)PM任务远难于RM任务,在相同上下文长度下,RM可达10万token时接近饱和,而PM准确率随上下文增长急剧衰减;(3)PM可作为剩余推理容量的行为探针,结合AIME-2025数学问题分析发现,成功解题路径中的PM准确率高于失败路径,说明PM能有效反映被词元数量掩盖的深层推理资源储备。

链接: https://arxiv.org/abs/2606.23459
作者: Tianhua Zhang,Xinjiang Wang,Qianxi Zhang,Qi Chen,Kun Li,Yaoqi Chen,Dingdong Wang,Helen Meng,Yan Lu
机构: The Chinese University of Hong Kong; Microsoft Research Asia
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with matched RM controls, contrastive positive/negative variants, and overloaded triggers, enabling fine-grained measurement of proactive recall, false-alarm rate, and attentional robustness under a single protocol. Our evaluation yields three key findings. (i) PM shows a precision-recall trade-off and attentional fragility. Though enhanced reasoning significantly improves proactive recall, models may overfit to an “always-remind” heuristic. Furthermore, PM accuracy degrades substantially under implicit constraints or triggers overloaded by concurrent user requests, indicating that robust PM remains an open challenge. (ii) PM is notably harder than RM: on identical contexts, RM near-saturates up to 100K tokens, while PM decays sharply as context length scales. (iii) PM may serve as a behavioral probe of spare reasoning capacity. Pairing PM scenarios with AIME-2025 math problems reveals that successful trajectories yield higher PM accuracy than failed ones at the same context length, showing PM tracks spare reasoning budget that token count obscures. Project page: this https URL.

[NLP-12] UnBias-Plus: Detect Explain and Rewrite Bias

【速读】: 该论文旨在解决自然语言中偏见(bias)检测的多重挑战,尤其针对现有方法在粒度化检测、可解释性分析、中性文本重写以及模型可复现性方面的不足。其核心解决方案在于提出UnBias-Plus——一个开源工具包,集成四大关键功能:(1)细粒度的段落级多类别偏见分类;(2)偏见片段定位(biased span localization);(3)生成中性重写文本(neutral text rewriting);(4)每项决策的可解释推理机制。通过提供Python、命令行接口(CLI)、REST API及网页界面等多种访问方式,UnBias-Plus实现了偏见分析的高可用性与易用性,且所有源代码、训练模型、数据集与文档均公开可获取,显著推动了偏见检测技术在新闻、教育及人工智能研究等领域的落地应用。

链接: https://arxiv.org/abs/2606.23412
作者: Ahmed Y. Radwan,Ahmed ElKady,Sindhuja Chaduvula,Mohamed Hafez,Amrit Krishnan,Shaina Raza
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Bias in natural language remains a persistent challenge in both human-written and AI-generated content, affecting domains such as journalism, education, and AI research. Most existing detection methods identify only the presence of bias, with limited support for granular detection, interpretable explanations, neutral rewriting, and openly available trained models. We present UnBias-Plus, an open-source toolkit unifying (1) segment-level multi-class bias classification, (2) biased span localization, (3) neutral text rewriting, and (4) reasoning for each decision. Available via Python, CLI, REST API, and web interfaces, UnBias-Plus supports accessible bias analysis. The toolkit, source code, models, datasets, and documentation are publicly available.

[NLP-13] Reasoning Lens: Hierarchical Visualization and Diagnostic Auditing for Large Reasoning Models ICIP

【速读】: 该论文旨在解决大模型在推理过程中产生的长链式思维(Chain-of-Thought, CoT)轨迹过长所带来的可解释性困境,即关键逻辑被淹没在大量冗余的程序化文本中,导致难以进行有效诊断与审计。其核心解决方案是提出一个名为ReasoningLens的开源框架,通过三个关键机制实现对复杂推理链的层次化可视化与系统性诊断:(1)将推理轨迹结构化为可交互的层级架构,实现高层策略与底层执行的分离;(2)引入智能体式审计器(agentic auditor),实现错误的自动化检测与工具增强型验证;(3)生成模型特有的系统性推理画像,揭示模型固有的认知盲区。该框架将非结构化的文本信息转化为可操作的洞察,为下一代以推理为核心的AI系统提供了模块化、可扩展的解释、调试与优化基础。

链接: https://arxiv.org/abs/2606.23404
作者: Jun Zhang,Jiasheng Zheng,Boxi Cao,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our project is available at this https URL

点击查看摘要

Abstract:The emergence of Large Reasoning Models has introduced exceptionally long Chain-of-Thought traces, creating a transparency burden where critical logic is often buried under massive procedural text. To address this, we present ReasoningLens, an open-source framework designed for the hierarchical visualization and diagnostic auditing of complex reasoning chains. ReasoningLens addresses information necropsy by: (1) structuring traces into interactive hierarchies that separate high-level strategy from low-level execution; (2) leveraging an agentic auditor for automated error detection and tool-augmented verification; and (3) synthesizing systemic reasoning profiles to reveal model-specific blind spots. By transforming unstructured walls of text into actionable insights, ReasoningLens provides a modular foundation for interpreting, debugging, and optimizing the next generation of reasoning-centric AI.

[NLP-14] Do LLM Embedding Spaces Recover Expert Structure?

【速读】: 该论文旨在解决预训练文本嵌入(pretrained text embeddings)在心理健康相关语言中的表征结构是否能够有效恢复专家定义的语义结构这一关键问题。尽管高类别可分性常被视为表征质量的指标,但其几何结构未必与领域专家所定义的症状关系一致,尤其在存在强烈领域、情感、风格及话语混淆因素的在线社区语境中更为显著。为此,研究基于28个Reddit心理健康社区数据,对比了0.6B和4B参数量级的预训练与监督微调后的Qwen3嵌入空间,通过构建类别原型,采用表示相似性分析(representational similarity analysis, RSA)将其表示差异矩阵与专家症状矩阵进行比较,并引入基于原型的典型性评估及多重基线混淆因子控制(包括VAD、LIWC、词汇风格与话题分布等)。结果表明:预训练嵌入在心理健康子集内已具备一定程度的专家结构对齐;微调显著增强了最细粒度类别层级的对齐程度;模型规模扩大则同时提升了零样本对齐能力与微调带来的增益。即便在控制多种潜在混淆变量后,残余对齐仍保持显著水平。因此,研究结论指出,大语言模型(LLM)嵌入具备恢复专家相关类别几何结构的能力,但该能力具有层级依赖性,且必须通过显式混淆因子检验而非仅依赖分类性能推断。

链接: https://arxiv.org/abs/2606.23394
作者: Yixuan Zhu,Zhenke Duan,Fanghen Li
机构: Zhongnan University of Economics and Law (中南财经政法大学); CodeSoul.co
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained text embeddings are increasingly used as representational maps, yet high category separability does not imply that their geometry recovers expert-defined structure. We study this problem in mental-health-related language, where symptom relations provide an external reference and online communities introduce strong domain, affective, stylistic, and discourse confounds. Using 28 Reddit communities, we compare pretrained and supervised fine-tuned Qwen3 embedding spaces at two scales (0.6B and 4B). We construct category prototypes, evaluate their representational dissimilarity matrices against an expert symptom matrix with representational similarity analysis, and complement this global test with prototype-based typicality and multi-baseline confound controls. Pretrained embeddings show measurable alignment with expert structure within the mental-health subset; fine-tuning strengthens this alignment most at the finest category level; and larger scale improves both zero-shot alignment and supervision-induced gains. Residual alignment remains substantial after controlling for VAD, LIWC, lexical style, and topic-distribution structure. These results suggest that LLM embeddings can recover expert-relevant category geometry, but this recovery is level-dependent and should be tested against explicit confounds rather than inferred from classification alone.

[NLP-15] Self-Stigma Is Not a Monolith but Generic Empathy Is: Persona-Conditioned LLM Support for People Who Use Drugs

【速读】: 该论文旨在解决成瘾者(People Who Use Drugs, PWUD)在寻求治疗过程中因自我污名(self-stigma)导致的治疗回避与脱落问题,现有对话系统通常将自我污名表达视为同质信号,忽视其内在异质性。研究提出一种基于角色型态(persona-aware)的生成式人工智能(Generative AI)支持框架,通过在1,174名Reddit用户自述污名表达数据上进行潜在类别分析(Latent Profile Analysis, LPA),识别出四个具有区分性的角色型态,并利用独立的行为与语言特征验证其有效性。进一步地,采用序列贝叶斯与循环神经网络分类器,仅需少量发帖历史即可准确恢复角色型态,显著优于批量处理与少样本提示的大型语言模型(LLM)基线(30次发帖后宏平均F1达0.74)。然而,临床专家评估显示:尽管针对特定角色型态设计的回应能有效引导目标行为改变,但评审者整体更偏好通用共情型(persona-neutral)响应。这一发现揭示了“整体共情评价”与“临床对齐响应设计”之间的张力,表明当前生成式AI反污名支持系统的评估需引入可分解的评分框架,以分别衡量情感共鸣与干预精准度。

链接: https://arxiv.org/abs/2606.23387
作者: Layla Bouzoubaa,Rezvaneh Rezapour
机构: Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-stigma predicts treatment avoidance and disengagement among people who use drugs (PWUD), yet conversational systems aiming to provide support typically treat self-stigma expression as a uniform signal. We present a three-phase, proof-of-concept study of a persona-aware approach to LLM support. Latent Profile Analysis (LPA) on indicator-level features from 1,174 self-stigma expressors on Reddit yields a four-persona typology validated against held-out behavioral and linguistic features. Sequential Bayesian and recurrent neural classifiers recover these personas from limited posting histories, substantially outperforming batch and few-shot LLM baselines (macro-F1 = 0.74 at 30 posts). Evaluation by eight clinical experts across three contemporary LLMs revealed a misalignment: persona-matched responses successfully achieved targeted behavioral shifts, yet raters holistically preferred the generic empathy of the persona-neutral baseline. Our findings suggest that holistic empathy judgments and clinically-aligned response design can pull in opposite directions, and that evaluating LLM-based stigma support requires rubrics capable of decomposing the two.

[NLP-16] Energy-Based Transformers as Predictors of Reading Difficulty

【速读】: 该论文旨在解决传统生成式语言模型中用于预测阅读难度的指标(如预期意外度,surprisal)与注意力熵(attention entropy)虽能互补反映加工负荷,但需多个独立指标共同解释现象的问题。其核心解决方案是引入基于能量的变压器模型(energy-based transformers),该模型通过与关联记忆模型(associative memory models)建立严格的数学联系,实现了与霍普菲尔德网络(Hopfield networks)及密集关联记忆理论的直接衔接。研究发现,在自然故事、UCL眼动追踪和自定步速阅读等阅读时间语料库中,能量值作为单一指标对阅读时间具有稳健预测能力,并在所有数据集中显著提升模型拟合度,超越了预期意外度的解释力。在相对从句处理的受控实验中,单一层的能量值即可捕捉到典型的宾语/主语不对称效应,且证据表明该能量指标可整合预期意外度与注意力熵所解释的效应,暗示其具备作为统一加工负荷预测指标的潜力,有望替代此前所需的多重互补指标体系。

链接: https://arxiv.org/abs/2606.23382
作者: Jakub Dotlacil,Ece Takmaz
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer language models have become established tools for modeling human sentence processing, with measures such as surprisal and attention entropy serving as effective predictors of reading difficulty that together capture complementary aspects of processing load. Here, we explore a related class of transformer models: energy-based transformers, which provide a principled formal link to associative memory models, bringing processing research into direct contact with the broader literature on Hopfield networks and dense associative memory. To our knowledge, this is the first exploration of an energy-based transformer measure in computational psycholinguistics. Across reading-time corpora (Natural Stories, UCL eye-tracking, UCL self-paced reading), the energy measure is a robust predictor of reading times, providing significant fit beyond surprisal in all three. In a controlled experiment on relative clause processing, energy at a single layer captures the well-known object/subject asymmetry. We find evidence that it subsumes effects attributable to both attention entropy and surprisal, suggesting that energy may serve as a single unified predictor where multiple complementary measures have previously been required.

[NLP-17] Measuring Mitigating Over-Alignment for LLM s in Multilingual Criminal Law Courts

【速读】: 该论文旨在解决生成式 AI(Generative AI)在刑事法律领域应用中因模型安全机制(guardrails)过度响应(over-alignment)导致的拒绝服务问题。具体而言,当处理包含暴力或性犯罪等敏感内容的刑事案件文本时,本地部署的小型语言模型(LLM)会因触发内容安全策略而拒绝生成内容或附加免责声明,从而严重影响法律工作的连续性和任务忠实度(task faithfulness)。其解决方案的关键在于提出并验证一种基于多语言真实案例数据的评估基准——TF-RefusalBench,该基准涵盖法语、德语、意大利语和英语共5,200个提示,专门用于检测模型在刑事法律翻译与摘要任务中的拒绝行为。研究发现,过量拒绝现象受模型类型、提示内容及语言的多重影响,且仅关注“拒绝率”不足以衡量实际影响,必须考虑免责声明对输出质量的削弱作用。进一步实验表明,通过去除或消融(abliteration)模型中的拒绝指令(refusal directions),可在几乎不损害任务性能的前提下显著降低拒绝率,为在严格合规要求下实现本地化LLM在刑事法律场景中的可行应用提供了有效路径。

链接: https://arxiv.org/abs/2606.23375
作者: Arthur Wuhrmann,Gaetan Stein,Daniel Brunner,Andrei Kucharavy
机构: Surelio.ai(苏瑞利奥人工智能); Swiss Federal Supreme Court(瑞士联邦最高法院); HES-SO Valais-Wallis(瓦莱-沃州高等专业学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tentative translations and short-passage summarization across the four official languages. However, such usage is challenging in the context of Criminal Law. Since rulings and cases employees work on routinely can contain detailed descriptions of violent and sexual offenses, their legitimate work is compromised by refusals and disclaimers due to the activation of model guardrails (over-alignment). To measure this phenomenon, we introduce TF-RefusalBench, a multilingual benchmark for criminal-law translation and summarization derived from public Swiss Supreme Court rulings. TF-RefusalBench contains 5,200 total prompts across French, German, Italian, and English, corresponding to common task prompts and passages likely to trigger refusal. We then use TF-RefusalBench to show that over-alignment is a multifaceted phenomenon, influenced by the model and the prompt and text languages being processed, and that its impact cannot be evaluated solely from an over-refusal perspective, given the disclaimer’s impact on task faithfulness. Finally, we evaluate approaches to enable on-premises LLMs for Criminal Law Tasks, demonstrating that while prompting can be effective, abliteration (refusal directions ablation) eliminates refusal with minimal impact on task performance. Comments: 15 pages, 7 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.23375 [cs.CL] (or arXiv:2606.23375v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.23375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-18] WaveDetect: Robust Framework for Machine-Generated Text Detection via Wavelet Transform

【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)文本检测方法在面对对抗性扰动、跨领域分布偏移以及基础模型快速迭代等挑战时表现脆弱的问题。现有检测器依赖表面语义特征,难以应对动态演化和复杂攻击场景。其解决方案的关键在于提出一种名为\wavedetect的新框架,将文本检测重构为时频域中的信号处理任务:通过引入可微分的连续小波变换(Continuous Wavelet Transform),将生成文本的概率分布建模为概率信号,并转化为可学习的谱表示,从而揭示机器生成文本中隐藏的“谱指纹”——这些模式在时域中不可见。该方法不仅显著提升了检测精度,还在对抗攻击、跨领域泛化及对新型演进模型的适应性方面展现出卓越鲁棒性,验证了谱分析作为生成文本检测新范式的有效性。

链接: https://arxiv.org/abs/2606.23336
作者: Zhichen Liu,Kaitong Qin,Linhan He,Yang Xu
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models asymptotically approach human-level fluency in natural language generation, solely relying on surface-level semantic artifacts for detecting LLM-generated texts has become increasingly precarious. Existing detectors often falter when facing three critical challenges: adversarial perturbations, cross-domain shifts, and the rapid temporal evolution of the foundation model. To address these issues, we propose \wavedetect, a novel framework that reformulates text detection as a signal processing task within the time-frequency domain. Unlike previous methods that analyze static token probability distributions, \wavedetect models the generated output as a probability signal, upon which a differentiable Continuous Wavelet Transform is applied to convert them into learnable spectral representations. This process reveals the intrinsic ``spectral fingerprints’’ in machine-generated texts–patterns that remain invisible in time domain. Comprehensive evaluations on three well-curated datasets (RAID, EvoBench, and Domain-Shift) show that our method achieves a new state-of-the-art. It not only achieves superior accuracy but also exhibits remarkable robustness against sophisticated attacks, generalization across out-of-distribution topics and unseen evolving LLMs. Our results validate the efficacy of spectral analysis as a promising paradigm for LLM-generated texts detection.

[NLP-19] max: A simple recipe for terminal agents

【速读】: 该论文旨在解决生成式语言模型在终端代理(terminal agent)应用中缺乏有效强化学习(RL)训练方法的问题,尤其针对现有研究受限于复杂评估基准、数据稀缺及缺乏可复现的基线方案等挑战。其解决方案的关键在于提出Tmax——当前最强大的开源强化学习训练范式,通过引入一种新颖的生成数据分类体系,融合难度控制、角色设定(persona)与验证器多样性(verifier diversification),实现了低成本、大规模生成适用于强化学习与监督微调(SFT)的终端环境数据。该方法仅使用90亿参数模型即在Terminal-Bench 2.0上达到27%的性能,超越此前更大规模模型的表现。研究同时开源了超过此前公开数据集2.5倍规模的终端代理数据集、训练模型及代码,为未来开放学术研究提供了强有力的基准支持。

链接: https://arxiv.org/abs/2606.23321
作者: Hamish Ivison,Junjie Oscar Yin,Rulin Shao,Teng Xiao,Nathan Lambert,Hannaneh Hajishirzi
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27% on Terminal-Bench 2.0 with only 9B parameters, outperforming much larger models from prior work. Concretely, we generate data using a novel taxonomy, combining difficulty control, personas, and verifier diversification, which allows us to cheaply generate large amounts of terminal environments for RL and SFT training. We open-source our terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets. We then train open-weight models using RL with our data, using a simple, outcome-only recipe. We release our data, models, and code as a strong baseline for future open academic work on terminal agents at this https URL.

[NLP-20] Uncertainty-based Debiasing and Unlearning for Decontamination

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)评估中因数据污染(data contamination)导致性能虚高、公平比较失效的问题。现有去污染方法通常仅依赖整体准确率进行评估,难以揭示样本级模型行为的差异,且多数方法需依赖未污染模型作为参考,限制了实际应用。本文提出一种基于样本级的评估框架,通过引入分布距离度量(distributional distance metrics),在每个样本层面衡量去污染后模型输出分布与理想未污染模型输出分布的接近程度,从而更精细地评估去污染效果。在此基础上,提出不确定性驱动去污染(Uncertainty-Based Decontamination, UBD)方法,利用污染模型的深度集成(deep ensembles)估计每个样本的遗忘程度,无需依赖未污染模型或已知污染样本,即可生成样本级校正系数,用于构建抑制污染导致正确答案概率虚高的去偏目标分布。该目标分布可作为后处理校正(debiasing)或软训练信号(soft training signal)以实现参数更新(unlearning)。在MMLU-Pro和MATH-MCQA多个主流模型架构上的实验表明,UBD在样本级输出分布上显著优于重述或选项置换等基线方法,同时保持对未污染数据的性能不变。

链接: https://arxiv.org/abs/2606.23313
作者: Guangzhi Sun,Xiao Zhan,Mark Gales
机构: University of Cambridge (剑桥大学); Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmark-based evaluation is the dominant paradigm for assessing large language model (LLM) capabilities, yet data contamination inflates reported performance and undermines fair comparison. Existing decontamination methods are evaluated solely through aggregate accuracy, which can obscure substantial differences in per-sample model behaviour, and many require access to an uncontaminated model. In this paper, we propose a sample-level evaluation framework for decontamination that complements accuracy-based assessment with distributional distance metrics, measuring how closely a decontaminated model recovers the output distribution of an uncontaminated model on each sample. Building on this framework, we introduce Uncertainty-Based Decontamination (UBD), a family of methods that leverage deep ensembles of the contaminated model to estimate per-sample memorization without requiring a uncontaminated model or knowledge of which samples are contaminated. UBD estimates a per-sample correction scalar from ensemble uncertainty, which is used to construct a debiased target distribution that suppresses the inflated probability mass on correct answers induced by contamination. This target is then used either as a post-hoc output correction (debiasing) or as a soft training signal for parameter update (unlearning). Experiments on MMLU-Pro and MATH-MCQA across multiple LLM backbones demonstrate that UBD produces per-sample output distributions substantially closer to those of an uncontaminated model than paraphrasing or choice-permutation baselines, while preserving model performance on uncontaminated data.

[NLP-21] he Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

【速读】: 该论文旨在解决基于连接时序分类(CTC)的N-best候选句选择中,仅依赖内部得分(CTC-internal scoring)在语音识别性能提升上的局限性问题。其核心挑战在于:尽管通过增加候选数量(G)可提高搜索空间覆盖度,但随着候选集规模扩大,CTC内部得分与真实词错误率(WER)之间的相关性显著下降,导致无法有效区分语言上更合理的假设。研究发现,这一现象的根本原因在于“信息瓶颈”——即空白路径(blank-path)的过度泛滥导致声学置信度与语言合理性之间出现脱节,使现有声学特征无法承载足够的判别信息。关键解决方案是引入外部语言模型信息以突破该瓶颈,具体采用基于最小贝叶斯风险(MBR)的解码框架,并结合RoBERTa生成的伪对数似然(PLL)后验作为语言约束信号。实验表明,在保持无重调参的前提下,该方法在多个语音识别任务(包括LibriSpeech、TED-LIUM 3、VoxPopuli)和噪声条件下均实现显著且一致的性能提升,尤其在大候选集(G=128)下将测试集WER降低至5.42%(相较贪婪解码下降0.535个百分点,相对提升9.0%)。此外,研究还揭示了标准最大词错误率(MWER)训练虽可通过CTC前向-后向算法实现方差更低的梯度估计,但在接近收敛时因训练“最优值差距”过小而失效,进一步说明仅依赖声学信号的优化机制存在固有缺陷。

链接: https://arxiv.org/abs/2606.23306
作者: Ivan Novosad
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 8 figures. Code and data: this https URL

点击查看摘要

Abstract:We study the limits of CTC-internal scoring for N-best hypothesis selection and locate the information bottleneck separating acoustic confidence from linguistic plausibility. Eleven CTC-internal and acoustic-feature scoring strategies produce no statistically significant WER improvement over greedy decoding on LibriSpeech dev-other at G=16 (all p 0.05). The exhaustion is systematic: CTC’s Spearman \rho between hypothesis score and per-utterance WER degrades from -0.574 at G=4 to -0.270 at G=128, a 53% loss driven by blank-path proliferation. This establishes that the discriminative capacity of CTC-internal representations is saturated: no recombination of acoustic signals can close the oracle gap. Confirming that the bottleneck is linguistic, not acoustic, external linguistic information introduced via MBR decoding breaks through it. MBR-CER decoding with a RoBERTa pseudo-log-likelihood (PLL) posterior ( \tau =10, G=128) achieves 5.42% WER on held-out LibriSpeech test-other (greedy 5.96%, \Delta =-0.535 pp, p0.0001, 9.0% relative). RoBERTa PLL \rho degrades only 21% over the same range, retaining discriminating power where CTC loses it. Applied without retuning across two Zipformer architectures, three domains (LibriSpeech, TED-LIUM 3, VoxPopuli), and four MUSAN noise levels, the recipe gives significant gains in 11 of 13 conditions. On the training side, standard MWER training via the CTC forward-backward algorithm implements Rao-Blackwellized REINFORCE at the output projection (variance about 3x below Viterbi). Yet sequence-level fine-tuning fails at near-converged checkpoints: all four MWER configurations on CR-CTC collapse (+6.18 to +8.90 pp WER), as a training oracle gap of 0.007 pp provides no usable reward signal. Comments: 30 pages, 8 figures. Code and data: this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2606.23306 [cs.CL] (or arXiv:2606.23306v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.23306 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-22] On the Effect of Segmentation Width and Cluster Size on Speech Resynthesis and Continuation in Generative Spoken Language Models INTERSPEECH2026

【速读】: 该论文旨在解决生成式语音建模(Generative Spoken Language Modeling, GSLM)在低比特率下仍能保持高质量语音合成与续写能力的问题。传统GSLM通常依赖较高比特率的离散语音表示,而本文通过固定窗口分割离散语音表示,并在不同聚类数量下训练K-means模型,实现多种比特率配置,从而系统评估比特率对语音生成性能的影响。研究发现,在低于基线比特率的情况下,仍可生成可懂且自然的语音,且语音续写质量在多个客观指标上保持稳定,表明当前主流的高比特率设定可能并非必要,存在冗余。尽管基于大语言模型(LLM-based)的评价指标相较于传统指标与人工主观评分具有更高相关性,但整体相关性仍较低,凸显了现有自动评估方法稳定性不足的问题,亟需更可靠的评估体系。

链接: https://arxiv.org/abs/2606.23285
作者: Shunsuke Kando,Wataru Nakata,Shinnosuke Takamichi,Yusuke Miyao
机构: The University of Tokyo (东京大学); Keio University (庆应义塾大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech2026

点击查看摘要

Abstract:Generative Spoken Language Modeling (GSLM) enables text-free speech modeling by training language models (LMs) using discrete speech representations instead of textual transcription. In this paper, we investigate the performance of GSLM on speech synthesis and continuation using discrete speech representations with varying bitrates. We segment speech representations with fixed widths and train K-means models in multiple cluster sizes, resulting in various bitrate settings. We demonstrate that intelligible and natural speech can be synthesized at lower bitrate settings than the baseline. Furthermore, speech continuation quality remains stable at lower bitrates across multiple metrics, suggesting that the conventional GSLM setting may be redundant for effective speech generation. Although LLM-based metrics show higher correlation with human subjective score than conventional metrics, it remains low, highlighting the need for more stable automatic evaluation methods.

[NLP-23] owards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLM s

【速读】: 该论文旨在解决个性化大语言模型(LLM)中记忆系统在长对话场景下对隐含逻辑记忆(implicit logical memory)检索不足的问题。现有检索方法主要依赖语义相似性,难以捕捉与当前意图逻辑相关但语义重叠度较低的关键记忆。为应对这一挑战,研究提出构建首个高质量基准测试集IMLogic,专门评估隐含逻辑记忆的检索能力。其核心解决方案是引入“根记忆”(root memory)——一种结构化且保留决策逻辑的表示形式,能够从长期用户历史中提炼出可复用的个性化逻辑;并设计了可即插即用的RootMem框架,通过将原始历史数据蒸馏为根记忆,并利用基于大语言模型的路由机制激活逻辑相关的记忆,从而在语义检索之外补充个性化决策逻辑。实验表明,RootMem显著优于现有最强检索基线,并持续提升已有记忆代理的准确性。

链接: https://arxiv.org/abs/2606.23283
作者: Hongxun Ding,Xiang Yu,Chengbing Wang,Jianfei Xiao,Keqin Bao,Wenjie Wang,Xiangnan He
机构: University of Science and Technology of China, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory systems are essential for personalized Large Language Models (LLMs). However, existing retrieval methods in these systems primarily rely on semantic similarity, potentially missing logically critical memories with limited semantic overlap. Current benchmarks remain inadequate for evaluating this problem. To address this gap, we construct IMLogic, the first high-quality benchmark targeting implicit logical memory retrieval in long-dialogue scenarios. Motivated by this challenge, we introduce root memory, a structured, decision-preserving representation that distills reusable personalized logic from long-term user histories. We then propose RootMem, a plug-and-play framework that first distills raw histories into structured root memories and then uses an LLM-based router to activate logically relevant ones, complementing semantic retrieval with personalized decision logic. Extensive experiments demonstrate that RootMem significantly outperforms the strongest retrieval baselines and consistently boosts the accuracy of existing memory agents. Our benchmark and codes will be available at this https URL.

[NLP-24] Scaling LLM Knowledge Boundaries via Distribution-Optimized Synthesis EMNLP2026 ACL

【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)在通过合成数据进行知识注入时存在的关键问题:现有合成方法仅依赖预设的词元数量或固定的数据比例,缺乏对知识分布的感知,导致部分领域知识稀疏而其他领域冗余,从而限制了大语言模型(Large Language Models, LLMs)的知识边界扩展。其解决方案的关键在于提出KDoS(Knowledge Distribution-optimized Synthesis)框架,该框架引入“知识密度”作为核心指标,构建三阶段反馈机制,实现从盲目生成向基于知识分布优化的合成转变。实验表明,存在一个最优的知识分布能够持续最大化知识边界的扩展,且该分布具有跨模型架构与数据规模的稳定性,KDoS在多个知识基准测试中显著优于基线方法,为基于合成数据的知识注入提供了新的理论视角与可落地的技术范式。

链接: https://arxiv.org/abs/2606.23271
作者: Songze Li,Yarong Lan,Zhongpu Bo,Zhaoyang Wang,Zhiqiang Liu,Yuan Yuan,Chengtao Gan,Menghao Qian,Enpei Niu,Xiaoke Guo,Yuanxiang Liu,Zhaoyan Gong,Xiangjin Hu,Liangyurui Liu,Jingdian Lu,Lei Liang,Jun Zhou,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); ZJU-Ant Group Joint Lab of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL)
备注: ACL ARR May (EMNLP 2026) Submission

点击查看摘要

Abstract:Knowledge injection via synthetic data is crucial for enhancing Large Language Models (LLMs). However, current synthesis methods simply stop at preset token counts or fixed data ratios, lacking awareness of knowledge distribution. This results in some domains being sparse while others are redundant, limiting LLM knowledge boundaries. We revisit knowledge injection from a distribution perspective and hypothesize that an optimal knowledge distribution exists to maximize knowledge boundary expansion. We propose KDoS (Knowledge Distribution-optimized Synthesis), a framework that introduces knowledge density to drive synthesis through a three-stage feedback mechanism, shifting from blind generation to distribution-optimized synthesis. We construct Wikipedia-based synthetic data with varying knowledge distributions and conduct experiments on models from 0.6B to 16B (Qwen, Ling, LLaMA) and data scales from 1B to 5B tokens. Our key findings are: (1) an optimal knowledge distribution consistently maximizes boundary expansion; (2) this distribution is stable across backbones and scales; (3) KDoS outperforms baselines across six knowledge benchmarks. Our work offers a new perspective and practical framework for synthetic data-driven knowledge injection.

[NLP-25] Judgment-Grounded Expansion for Peer Review Generation

【速读】: 该论文旨在解决自动审稿生成(automatic review generation)在追求高效自动化的同时,难以满足科学评审中对可问责性(accountability)需求的问题。现有端到端的自动化方法虽能快速生成评论,但缺乏人类审稿人对判断依据的控制与透明度,限制了其在高可靠性场景中的应用。为此,论文提出“基于判断的扩展”(judgment-grounded expansion)这一人机协同模式,其核心在于由人类审稿人给出评价性主张(evaluative claim),系统则基于此生成相应的审稿意见候选内容。解决方案的关键在于构建一个结构化的“生成-验证-优化”(generate-check-refine)流程,并通过用户研究采集真实的人机交互数据以支撑方法设计。针对实际应用中的两大挑战——可扩展的评估与候选集的筛选,研究提出了大规模模拟评估方法,并验证了置信区间预测(conformal prediction)在平衡候选集规模与目标覆盖范围方面的有效性。该工作首次将“基于判断的扩展”确立为一项明确的任务范式,为未来协作式审稿生成系统的设计提供了实证基础与方法论支持。

链接: https://arxiv.org/abs/2606.23233
作者: Sheng Lu,Lizhen Qu,Iryna Gurevych
机构: Technical University of Darmstadt (达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心 ATHENE); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic review generation is a promising direction for accelerating scientific progress. While most work adopts an end-to-end setup, its fully automated nature makes it less suitable for settings that demand accountability. To better balance automation and accountability, we formalize judgment-grounded expansion, a human-AI collaboration mode where a reviewer provides an evaluative claim and the system expands it into review comment candidate(s). We model it as a structured generate-check-refine process and conduct a user study to collect human-model interaction data. We study two practical challenges for judgment-grounded expansion: scalable evaluation and candidate set curation. We develop methods to simulate the process for large-scale evaluation, and show that conformal prediction is well suited to balancing candidate set size and target coverage. Our work establishes judgment-grounded expansion as a concrete task and provides empirical and methodological foundations for the design of future collaborative review generation systems.

[NLP-26] MuPPET: A Benchmark for Contextual Privacy of LLM Assistants in Multi-Party Conversations

【速读】: 该论文旨在解决大语言模型(LLM)在多参与者对话场景中面临的上下文隐私泄露问题。随着LLM代理被广泛应用于包含敏感个人数据的群组聊天等多用户环境,一旦模型泄露私密信息,将瞬间传播至所有成员,其风险远高于一对一场景,因为每条信息需对所有接收方均具备适当性。然而,现有上下文隐私评估基准均仅针对单对话者情境,未能覆盖多参与者环境中的隐私暴露风险。为此,论文提出MuPPET(多参与者隐私暴露测试)基准,专门用于评估多参与者对话中的上下文隐私安全。实验表明,模型在多参与者设置下的信息泄露程度显著高于一对一评估结果;前沿模型存在脆弱性,而常因本地部署敏感数据而被选用的小型开源权重模型则更为严重。现有上下文隐私防御措施仅提供部分保护,且伴随性能退化,并无法根本解决多参与方状态追踪这一核心问题。

链接: https://arxiv.org/abs/2606.23217
作者: Elena Sofia Ruzzetti,Cornelius Emde,Sangdoo Yun,Seong Joon Oh,Martin Gubri
机构: Parameter Lab(参数实验室); University of Rome Tor Vergata(罗马大学托尔·韦尔加塔分校); University of Oxford(牛津大学); NAVER AI Lab(NAVER人工智能实验室); KAIST AI(韩国科学技术院人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are increasingly deployed in multi-party environments, handling sensitive personal data on behalf of individual users, for instance in group chats. When such an agent discloses private information, it reaches every group member at once. This risk is structurally harder to control than in one-to-one settings, as every piece of private information must be appropriate for every recipient in the group. Yet all existing contextual privacy benchmarks consider only single-interlocutor settings, leaving multi-party privacy risks unmeasured. We introduce MuPPET (Multi-Party Privacy Exposure Testing), a benchmark for contextual privacy in multi-party conversations. Our experiments show that models leak substantially more in multi-party settings than one-to-one evaluations suggest. Frontier models are vulnerable, and smaller open-weights models, often preferred for local deployment with sensitive data, even more so. Existing contextual privacy defences offer only partial protection, degrade utility, and do not resolve the underlying party-tracking problem.

[NLP-27] CFPO: Counterfactual Policy Optimization for Multimodal Reasoning ICML2026

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在多模态推理中因强化学习(Reinforcement Learning, RL)范式缺乏显式的反事实增强与因果学习机制而导致的严重“接地失败”问题。具体表现为模型倾向于忽略视觉证据而依赖语言先验,或在长链式思维推理过程中出现幻觉漂移。其解决方案的关键在于提出一种名为反事实策略优化(CounterFactual Policy Optimization, CFPO)的新框架,通过引入跨模态反事实增强机制,强制视觉感知与文本推理之间的因果一致性。该机制通过对关键视觉线索被抑制的反事实状态下的预测结果与原始状态进行差异最大化来正则化策略,从而提升模型对视觉信息的依赖性与推理可靠性。CFPO可无缝集成至GRPO、DAPO等标准算法,无需外部奖励模型或额外监督信号,实验表明其在推理保真度上显著优于传统RL基线及当前最优的感知感知方法(PAPO),性能提升达3.17%–6.25%和1.32%–2.13%。

链接: https://arxiv.org/abs/2606.23206
作者: Zhangyuan Yu,Wanran Sun,Guangjing Yang,Xiaohu Wu,Qicheng Lao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICML 2026. 17 pages

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model’s predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO). Code is available at this https URL.

[NLP-28] When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis

【速读】: 该论文旨在解决生成式AI(Generative AI)在自我修正(Self-Correction, SC)过程中有效性存疑的问题,即模型在缺乏外部反馈的情况下,难以准确判断自身初始输出的正确性。其解决方案的关键在于提出一种任务敏感型的自我修正视角:并非将SC视为普适有效的改进方法,而是识别其在特定任务结构下通过不同机制发挥作用的可能性,包括验证显式约束、重构复杂推理过程,或在词类游戏任务中对多种策略提供二次评估。研究发现,在符合这些机制的任务场景中,自纠正可带来稳定的性能提升,表明自纠正应被视为一种依赖于任务特性的推理阶段优化策略,其有效性取决于修正环节在具体任务中所能发挥的作用,而非普遍适用的输出增强手段。

链接: https://arxiv.org/abs/2606.23196
作者: Elroy Stav,Dvir Berlowitz,Maayan Orner,Sarit Kraus
机构: Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intrinsic self-correction (SC) aims to improve large language model outputs by prompting a model to revisit its own initial answer without external feedback. Recent studies have questioned the reliability of this approach, showing that models often struggle to judge whether their initial responses are correct. In this work, we take a task-sensitive view of SC. Rather than asking whether it works in general, we examine settings where SC may operate through different mechanisms: verifying explicit constraints, revisiting a complex reasoning process, or providing a second opinion over competing strategies in word-game tasks. Across multiple benchmarks and models, we find that SC can yield consistent performance gains when the underlying task structure facilitates these modes of revision. These results suggest that SC is best understood as a task-dependent inference-time strategy whose usefulness depends on the role the revision stage can play in a given task, rather than as a uniformly reliable method for improving initial model outputs.

[NLP-29] Memory Contagion: Cross-Temporal Propagation of Evaluator Bias via Agent Memory

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期运行中因记忆系统持续整合而导致的记忆退化问题,尤其关注现有研究忽视的一个关键前提——记忆源是否受评估者偏见(evaluator bias)影响。研究发现并形式化了一种新型现象:记忆传染(Memory Contagion),即评估者偏见通过代理的记忆存储与回溯机制,在时间维度上传播至后续代理,即使记忆整合过程理想(即“理想情况”下的完美固化,oracle condition)。其解决方案的关键在于揭示:记忆中的偏见并非仅源于存储机制缺陷,而是由初始训练或引导阶段的偏见输入所驱动,并能在记忆固化后仍持续传播。实验表明,无论是否存在完美记忆巩固,只要输入轨迹带有偏见,就会导致跨时间的偏见传播;同时,不同类型的偏见(如长度偏好、权威偏见)对记忆固化具有相反作用——前者被削弱,后者则可能被放大,且不存在安全阈值,极低污染率(p=0.2)即可引发可检测的偏见传播。这一发现揭示了当前代理记忆架构中深层次的脆弱性,并为量化跨时间偏见传播提供了形式化工具。

链接: https://arxiv.org/abs/2606.23195
作者: Zewen Liu
机构: Independent Researcher (独立研究员)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Large Language Model (LLM) agents increasingly rely on memory systems to maintain long-term coherence. Recent work shows that agent memories degrade during continuous consolidation. However, existing research assumes memories are derived from unbiased experiences. In this work, we identify and formalize a novel phenomenon: Memory Contagion – the cross-temporal propagation of evaluator bias through agent memory. We show that when agents are trained or guided by biased evaluators, their experiences become biased; when these trajectories are stored and consolidated into memory, the bias propagates to future agents retrieving from the same memory store, even when consolidation is perfect (oracle). Across two bias types (length preference, authority bias) and four experimental phases, we demonstrate: (1) Memory Contagion occurs even with perfect consolidation (oracle condition), proving that biased input is a sufficient cause of contagion; (2) Consolidation has opposite effects depending on bias type – robustly attenuating length bias while preliminarily amplifying authority bias (single-run estimate), suggesting a bias-type-dependent interaction; (3) No observed safe threshold: bias propagation is detected at contamination rates as low as p=0.2. Our findings expose a critical vulnerability in current agent memory designs and provide formal tools for measuring cross-temporal bias propagation.

[NLP-30] Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

【速读】: 该论文旨在解决计算机使用代理(Computer-use Agents, CUAs)在跨应用操作中引发的隐私泄露问题,尤其关注代理在执行任务时可能无意间获取并披露本不应在当前上下文中暴露的敏感信息。其核心挑战在于:当代理在某一应用环境中执行任务时,可能因视觉位置邻近、任务描述模糊或收件人不匹配等原因,导致个人隐私数据被不当共享。为系统评估此类风险,研究提出AgentCIBench——一个可执行、可确定性评分的评估框架,聚焦三种典型失效模式:视觉共位(visual co-location)、任务模糊性过度披露(task-ambiguity overshare)以及收件人错配(recipient misalignment)。实验评估15个前沿代理发现,11个代理在超过50%的场景中出现泄露,平均泄露率达67.9%,且这些缺陷在端到端任务执行中依然持续存在。因此,解决方案的关键在于通过AgentCIBench实现对上下文披露行为的标准化测试,推动将上下文披露检测作为部署前的安全检查机制。

链接: https://arxiv.org/abs/2606.23189
作者: Anmol Goel,Iryna Gurevych
机构: TU Darmstadt (达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer-use agents (CUAs) now act on a user’s behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three common failure modes in CUAs: visual co-location, where the agent pulls in prohibited items that sit next to the task target in the UI; task-ambiguity overshare, where the agent dumps dense personal state in response to an under-specified prompt; and recipient misalignment, where the agent sends content to an addressee for whom it is inappropriate. We evaluate 15 frontier agents and find a surprisingly high failure rate: 11 of 15 leak on more than 50% of scenarios, with an average leakage of 67.9%, and the same failures persist when agents act end-to-end in the environment to complete the task. We release AgentCIBench to encourage the development of safer computer-use agents and position contextual disclosure testing as a pre-deployment safety check.

[NLP-31] DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models

【速读】: 该论文旨在解决混合推理模型在面对不同难度问题时如何高效分配推理资源的问题,即如何在直接作答(direct answering)与扩展思考(extended thinking)之间进行智能路由选择,以避免简单问题消耗过多推理成本,同时确保复杂问题获得足够推理预算。现有方法通常依赖标注训练数据或预先固定推理预算,未能充分利用模型自身输出的置信度与一致性等答案层面的证据。本文提出DART(Draft-Aware Router for Thinking),一种无需训练的路由框架,其核心创新在于:通过采样两个低成本的“无思考草稿”(no-think drafts),当两者一致时直接采纳答案,从而跳过冗余推理;当两者不一致时,基于草稿的熵(draft entropy)动态预测所需的思考预算。该方法在不使用任何标注数据、无需梯度更新的前提下,实现了对模型规模(0.6B–32B)、模型家族及仅限API调用场景的广泛适用性。实验表明,DART在多数情况下保持甚至提升了始终采用思考模式的准确率,同时显著降低思考令牌(thinking tokens)使用量:在奥数级数学推理任务中,准确率最高提升9.0个百分点,思考令牌减少15%–69%;在基于执行等价性的代码推理任务中,准确率最高提升22.5个百分点,思考令牌减少51%–63%,充分验证了其高效性与普适性。

链接: https://arxiv.org/abs/2606.23181
作者: Jungseob Lee,Seongtae Hong,Seungjun Lee,Jaehyung Seo,Junyoung Son,Sugyeong Eo,Chanjun Park,Hyeongju Park,Hyeonseok Moon,Heuiseok Lim
机构: Korea University(韩国高丽大学); 42dot; Konkuk University(韩国国民大学); Yonsei University(延世大学); Soongsil University(松林大学); Kumoh National Institute of Technology(庆北国立科技学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 16 tables. Code: this https URL

点击查看摘要

Abstract:Hybrid reasoning models can answer directly or spend extra tokens on extended thinking. A practical router should choose between these modes for each query, so easy problems avoid unnecessary reasoning and hard problems receive enough budget to finish the answer. Existing routers move in this direction, but they typically require labeled training data or fix thinking budgets up front, ignoring answer-level evidence from the model itself. We introduce DART, a training-free routing framework that samples two cheap no-think drafts, accepts direct answering when the drafts agree, and predicts a thinking budget from draft entropy when they disagree. Across the main comparisons, DART preserves or improves always-thinking accuracy in most settings while reducing thinking-token use. On math reasoning, accuracy improves by up to + 9.0 points on Olympiad-level problems while thinking tokens drop 15-69%. On code reasoning under execution-based equivalence, accuracy improves by up to +22.5 points while thinking tokens drop 51-63%. The Stage~1 signal extends across model scales (0.6B-32B), model families, and API-only hosted settings, with no labeled data and no gradient updates required.

[NLP-32] Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS INTERSPEECH2026

【速读】: 该论文旨在解决语音合成系统在复杂环境(如噪声干扰或面对听力障碍用户)下缺乏自然清晰度提升机制的问题,尤其针对人类在困难环境下会不自觉地增大音量、加强发音的“Lombard效应”进行建模。其解决方案的关键在于提出一种基于流匹配(flow-matching)的文本到语音(TTS)模型,通过引入声学努力程度(vocal effort)与发音清晰度(articulation)的伪标签进行训练,实现了对声学努力和发音清晰度的连续且解耦的可控调节,并支持词级强调(word-level emphasis),以增强特定语段的可理解性。实验结果表明,该方法有效提升了与清晰度相关的声学特征,且在语音-噪声场景下的测试验证了模型能成功模拟人类清晰语音在嘈杂环境中提升语音可懂度的效果。

链接: https://arxiv.org/abs/2606.23176
作者: Seymanur Akti,Alexander Waibel
机构: Karlsruhe Institute of Technology (KIT), Germany; Carnegie Mellon University (CMU), USA; KIT Campus Transfer (KCT), Germany
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Humans tend to speak louder and clearer in challenging environments, such as noisy conditions or when addressing hearingimpaired listeners, which is called Lombard effect. To simulate this behavior in speech synthesis systems, we introduce a flow-matching based text-to-speech (TTS) model trained with vocal effort and articulation pseudo-labels. The proposed model achieves continuous and disentangled control of vocal effort and articulation, while also enabling word-level emphasis for clarifying specific segments of an utterance. Experimental results show that these control mechanisms effectively improve clarityrelated acoustic features. Furthermore, speech-in-noise experiments demonstrate that our model successfully simulates the intelligibility gains of human clear speech in noisy conditions.

[NLP-33] Same question different history: language national identity and credit in large language models

【速读】: 该论文旨在解决的问题是:在涉及历史发明与发现的争议性议题中,大型语言模型(Large Language Models, LLMs)如何受语言因素影响而呈现出不同的历史叙事,进而塑造出系统性的国家记忆差异。具体而言,研究关注的是,当同一历史争议问题以不同语言提出时,模型是否会在回答中偏向特定国家或人物,从而反映出语言对历史认知的结构性偏见。其解决方案的关键在于通过实证分析11个主流大语言模型在12种语言下对21个有争议发明与发现的回应(共75,896条响应),揭示语言作为“文化可见性开关”的作用机制——即低地位主张者(如非英语母语国家的发明者)更可能在其母语提问中被提及,而主导性盎格鲁话语中的代表人物(如马可尼、贝尔)则在多语言情境中保持稳定输出。这一现象在控制了模型类型、回答长度、历史知名度及国家纪念程度等变量后依然显著,表明大语言模型并非中立的知识容器,而是分布式文化记忆系统,其输出受语言框架驱动,生成一种计算化的“日常民族主义”(banal nationalism),从而重塑公众对历史归属的认知。

链接: https://arxiv.org/abs/2606.23164
作者: William Guey,Pierrick Bougault,Wei Zhang,Vitor D. de Moura,José O. Gomes
机构: Tsinghua University (清华大学); Federal University of Rio de Janeiro (里约热内卢联邦大学)
类目: Computation and Language (cs.CL)
备注: 27 pages (main text and Supplementary Information combined), 5 figures, 9 tables

点击查看摘要

Abstract:Who invented the radio, Russia’s Alexander Popov or Italy’s Guglielmo Marconi? Was the telephone the achievement of Bell in the United States or Meucci in Italy? Does printing belong to China’s Bi Sheng or Germany’s Gutenberg? The answer depends not only on historical record but also on language and perspective. We analyse eleven widely used large language models across 21 disputed inventions and discoveries, evaluated in twelve languages and 75,896 responses. While models generally acknowledge that credit is contested, query language systematically affects which claimant is surfaced. Lower-status claimants are more likely to appear when questions are asked in their associated language, whereas dominant Anglophone figures remain stable across languages. These patterns persist after controlling for response length, model differences, historical prominence, and levels of national commemoration. Language thus acts as a switch that activates different national versions of the same history, producing systematically different national memories from the same question. We interpret this as evidence that large language models function as distributed systems of cultural memory, where language conditions which histories become visible, contributing to a computational form of banal nationalism. Comments: 27 pages (main text and Supplementary Information combined), 5 figures, 9 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.23164 [cs.CL] (or arXiv:2606.23164v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.23164 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: William Guey [view email] [v1] Mon, 22 Jun 2026 11:05:11 UTC (89 KB)

[NLP-34] Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

【速读】: 该论文旨在解决低资源语言(如克什米尔语)在光学字符识别(OCR)中面临的标注数据匮乏与书写系统特异性渲染复杂性问题。克什米尔语主要使用波斯-阿拉伯体纳斯塔利格(Nastaliq)脚本,其特有的上下文形变、密集连写及拼写变体进一步加剧了识别难度。为应对这一挑战,论文提出Koshur Pixel——首个大规模合成型克什米尔语OCR数据集,基于KS-PRET-5M语料库,采用SynthOCR-Gen框架生成613,078组图像-文本对。该数据集涵盖多种字体和文本粒度(从单个单词到整页文档),并集成25种以上增强策略以模拟真实文档退化情况。其核心解决方案在于通过自动化合成方式构建高质量、多样化且可扩展的训练数据,有效替代高成本的人工标注,为克什米尔语OCR系统训练、文本遗产数字化及低资源语言技术发展提供关键基础设施。

链接: https://arxiv.org/abs/2606.23144
作者: Haq Nawaz Malik,Faizan Iqbal,Nahfid Nissar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) for low-resource languages is often constrained by the lack of annotated training data and the complexity of script-specific rendering. Kashmiri, written primarily in the Perso-Arabic Nastaliq script, presents additional challenges due to contextual glyph shaping, dense ligatures, and orthographic variability. We introduce Koshur Pixel, the first large-scale synthetic OCR dataset for Kashmiri, comprising 613,078 image-text pairs generated from the KS-PRET-5M corpus using the SynthOCR-Gen framework. The dataset spans multiple fonts and textual granularities, ranging from individual words to full-page documents, and incorporates more than 25 augmentation strategies that emulate real-world document degradations. Koshur Pixel provides a scalable and cost-effective alternative to manual annotation, establishing a foundational resource for training OCR systems, digitizing Kashmiri textual heritage, and advancing language technologies for a severely under-resourced language.

[NLP-35] Managing Procedural Memory in LLM Agents : Control Adaptation and Evaluation

【速读】: 该论文旨在解决生成式 AI 代理在重复性工作场景中利用程序化记忆(procedural memory)提升性能时,其可复用技能的迁移能力尚不明确的问题。核心挑战在于评估程序化记忆在不同任务、角色及模型架构间的泛化与转移效率。解决方案的关键在于提出 AFTER 基准测试体系,涵盖 382 个真实企业级任务,覆盖六类职业角色与 22 种程序化技能,支持局部优化、跨任务迁移、跨角色迁移及跨模型泛化等受控评估场景。实验表明,程序化记忆可显著提升工业工作流表现(单次优化提升 3.7–6.7 分),且基于多模型执行轨迹演化出的技能在跨模型测试中达到 73.1% 的准确率,优于单一模型来源。研究进一步揭示部分技能具备广泛通用性,而另一些则高度专业化于特定角色流程,存在迁移失效风险,为生产环境中程序化记忆系统的构建、评估与部署提供了实证指导。

链接: https://arxiv.org/abs/2606.23127
作者: Julia Belikova,Rauf Parchiev,Evgeny Egorov,Grigorii Davydenko,Gleb Gusev,Andrey Savchenko,Maksim Makarenko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Procedural memory is increasingly used to improve LLM agents on recurring workplace tasks, yet its ability to produce reusable skills remains poorly understood. We introduce AFTER, a benchmark of 382 realistic enterprise tasks spanning six professional roles and 22 procedural skills, designed to evaluate how skills transfer across tasks, roles, and model backbones. The benchmark includes controlled evaluation settings for local improvement, cross-task transfer, cross-role transfer, and cross-model generalization. Experiments show that procedural memory delivers consistent gains in industrial workflows: a single refinement round improves aggregate performance by 3.7-6.7 points, while skills evolved from diverse multi-model execution traces achieve 73.1% cross-model test accuracy, outperforming all single-model trace sources. We further find that some skills generalize broadly across tasks and models, whereas others become specialized to role-specific workflows and lose effectiveness under transfer. These results provide practical guidance for building, evaluating, and deploying procedural memory systems in production agent platforms.

[NLP-36] PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation

【速读】: 该论文旨在解决大语言模型在共情对话生成中因计算资源需求过高而难以部署于资源受限环境的问题。其核心挑战在于,现有知识蒸馏方法在压缩模型时往往无法有效传递共情所需的细微理解能力,主要原因是对隐含上下文线索(如情感状态、情境背景)的忽视。为此,论文提出了一种基于特权信息增强的知识蒸馏方法——PRIDE(Privileged Information-enhanced Knowledge Distillation for Empathetic Dialogue Generation)。该方法的关键在于:(1) 引入共情推理提示(empathy-reasoning prompt),引导教师模型显式地将共情过程分解为“情感理解”与“情境分析”两步;(2) 设计多源注意力机制,使学生模型能够高效融合训练阶段可用但推理阶段不可见的特权信息(如专家心理标注或未来事件摘要);(3) 采用双对齐损失函数,结合反向KL散度与最大均值差异(Maximum Mean Discrepancy, MMD),在输出概率和特征表示层面实现鲁棒的知识迁移。实验表明,该方法在多模态与纯文本数据集上均取得优异性能,部分情况下甚至超越更大规模的教师模型,在准确性和语义相关性方面表现突出。

链接: https://arxiv.org/abs/2606.23124
作者: Jiaqiang Wu,Zhouan Zhu,Shangfei Wang
机构: Anhui Robot Technology Standard Innovation Base, School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院安徽机器人技术标准创新基地)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have demonstrated significant capabilities in generating diverse and context-aware responses for empathetic dialogue. However, their computational demands severely limit their deployment in resource-constrained environments. While knowledge distillation offers a promising compression solution, it often fails to transfer the nuanced understanding essential for empathy, as it overlooks the implicit contextual cues that guide human connection. To bridge this gap, we propose a \textbfprivileged \textbfinformation-enhanced knowledge \textbfdistillation method for \textbfempathetic dialogue generation (PRIDE). Our method leverages privileged information, such as expert psychological annotations or future event summaries, which is available exclusively during training but unavailable at inference time. This allows us to transfer the teacher model’s empathetic reasoning to smaller models without relying on extra inputs during deployment. Specifically, PRIDE has three key components: (1) An empathy-reasoning prompt that guides the teacher to explicitly decompose the empathetic process into understanding feelings and analyzing situations step-by-step; (2) A multi-source attention mechanism that directs the student to effectively integrate privileged information; (3) A dual-alignment loss that combines reversed Kullback-Leibler divergence and maximum mean discrepancy to ensure robust knowledge transfer at both logit and feature levels. Experiments on multi-modal and text-only datasets demonstrate that our method achieves competitive performance, and in some cases matches or even surpasses larger teacher models in terms of accuracy and semantic relevance.

[NLP-37] Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

【速读】: 该论文旨在解决多轮工具使用智能体在长时序工具序列协调中面临的挑战,即如何在跟踪对话状态与策略约束的同时,实现高效且结构化的工具选择。现有方法通常将推理阶段的调度与参数层面的学习相分离,导致工具选择缺乏结构性,且偏好更新易受训练-部署提示不匹配的影响。为此,论文提出ToolGraph框架,其核心在于整合基于模式的拓扑结构、从成功轨迹中估计的转移权重,以及考虑历史信息的写入前置条件与重复搜索循环控制机制。在此基础上,通过基于状态匹配与前缀对齐定位分歧点,构建了161组偏好样本,并利用动作正确性标注进行筛选,最终在与推理阶段相同的ToolGraph上下文中训练直接偏好优化(DPO)。实验表明,在375个tau2-bench任务上,ToolGraph将加权平均奖励从0.304提升至0.338(相对提升11.2%),而ToolGraph+DPO进一步达到0.355(相对于基线提升16.8%),其中DPO增益主要集中在航空与零售领域。细粒度诊断显示,约一半电信任务在执行动作前已耗尽步数预算,且所选奖励正值是16种评估的DPO配置中最具价值的检查点信号。

链接: https://arxiv.org/abs/2606.23112
作者: Jiaqiang Tang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大學(廣州))
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Multi-turn tool-using agents must coordinate long-horizon tool sequences while tracking dialogue state and policy constraints. Existing approaches often separate inference-time orchestration from parameter-level learning, leaving tool selection weakly structured and preference updates vulnerable to train–deployment prompt mismatch. For within-benchmark self-improvement, ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops. We then construct 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered through action-correctness annotations, and train DPO under the same ToolGraph context used at inference. Across 375 tau2-bench tasks, ToolGraph raises the weighted average reward from 0.304 to 0.338 (+11.2% relative), while ToolGraph+DPO reaches 0.355 (+16.8% over the baseline), with the DPO gain concentrated in airline and retail. Fine-grained diagnostics further show that roughly half of telecom trajectories exhaust the step budget before action execution and that chosen reward positivity is the most useful checkpoint signal across our 16 evaluated DPO configurations.

[NLP-38] A Dual-Track Framework for Template-Constrained LaTeX Conversion

【速读】: 该论文旨在解决将结构化Markdown草稿映射为符合模板要求的格式(如LaTeX)时存在的两大核心问题:一是传统基于规则的转换器难以正确处理资产插入和模板特定约束;二是纯端到端大语言模型(LLM)生成方法易引入语义漂移,导致难以调试的幻觉现象。其解决方案的关键在于提出一种稳健的双轨框架(Dual-Track Framework),系统性地解耦模板格式化与文档内容处理:离线轨道将模板约束提取为可复用的声明文件(manifest),在线轨道则采用混合执行流水线,仅在语义推理密集型任务(如语义元数据、参考文献、复杂图表布局)中使用LLM,而将确定性处理任务交由基于规则的引擎完成。该设计既保留了LLM在复杂语义理解上的优势,又通过规则引擎确保了格式合规性与结构保真度,实验结果表明,该方法在7种LaTeX模板和56篇已发表研究论文上的评估中,显著提升了结构保真度、满足多样化布局约束的能力,并实现了更高的编译成功率。

链接: https://arxiv.org/abs/2606.23107
作者: Chung Cheuk Hei,Liu Li
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL)
备注: 6 pages (excluding references), 10 figures

点击查看摘要

Abstract:With the increasing demands for advanced document conversion, mapping structured Markdown drafts into template-compliant formats like LaTeX remains a challenge. Existing approaches largely depend on either deterministic rule-based converters or pure end-to-end Large Language Model (LLM) generation. The former fails to correctly handle asset insertions and template-specific constraints, while the latter tends to induce semantic drift, leading to hallucinations that are difficult to debug. To address these limitations, we introduce a robust Dual-Track Framework that systematically decouples template formatting from document processing: an offline track extracts template constraints into a reusable manifest, while an online track implements a hybrid execution pipeline. This pipeline confines LLM usage exclusively to reasoning-intensive components (e.g., semantic metadata, bibliographic references, and complex visual/tabular layouts) while delegating rule-based engines for deterministic processing. Empirical evaluation across 7 LaTeX templates and 56 published research papers demonstrates that our method preserves better structural fidelity, satisfies diverse layout constraints, and achieves a higher compilation success rate compared to the previous baselines.

[NLP-39] Cognitive Digital Twins: Ethical Risks and Governance for AI Systems That Model the Mind

【速读】: 该论文旨在解决生成式认知数字孪生(Cognitive Digital Twins, CDTs)所带来的新型治理挑战。随着人工智能系统日益持久化与个性化,传统技术框架在应对动态计算表征个体认知、并基于行为、情境或生理数据进行建模、预测或代理决策时已显不足。其核心问题是:现有针对个人助手、自主代理、推荐系统及自动化决策系统的治理策略无法充分覆盖CDT所特有的认知表征层面的风险。解决方案的关键在于提出一个名为“5A”的治理框架,涵盖权威(Authority)、自主性(Autonomy)、访问与控制(Access and Control)、问责制(Accountability)以及可用性(Availability),并强调对认知表征本身的治理,而非仅限于数据处理或最终决策行为。该框架特别关注诸如认知误表征、认知权威转移、影子孪生、模拟参与、代理行动及代理权力不对称等特有风险,并提出针对高风险CDT的治理要求,包括强化同意机制、目的限定、有效性验证、可追溯性、异议权、独立审查及模型退役机制,以确保认知表征过程的透明性、公正性与可控性。

链接: https://arxiv.org/abs/2606.23094
作者: Vamshi Krishna Bonagiri,Juan Nicolas Sepulveda-Arias,Abdoul Jalil Djiberou Mahamadou,Monojit Choudhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work under review

点击查看摘要

Abstract:As AI systems become increasingly persistent and personalized, they make possible a class of technologies that we call cognitive digital twins (CDTs): dynamic computational representations of a specific person’s cognition, updated from behavioral, contextual, or physiological data in order to model, predict, or simulate that person’s cognition, or to act as that person’s communicative or decision-making proxy. CDTs combine cognitive inference with longitudinal representation, simulation, and proxy action in ways that existing governance strategies for personal assistants, autonomous agents, recommender systems, and automated decision systems only partially address. This paper makes four contributions. First, we define CDTs and distinguish them from adjacent systems. Second, we introduce a 5A governance framework organized around authority, autonomy, access and control, accountability, and availability. Third, we identify CDT-specific risks, from misrepresentation and epistemic authority shifts to shadow twins, simulated participation, proxy action, and proxy-power asymmetries. Fourth, we analyze governance gaps and propose requirements for high-risk CDTs that strengthen consent, purpose limitation, validity, traceability, contestation, independent review, and model retirement. Existing frameworks primarily regulate data processing, automated decisions, or autonomous actions; CDTs also require governance at the level of cognitive representation itself, before any final decision or external action occurs. We argue that CDTs require governance not only because they can act for people, but because they can become infrastructures through which cognition is represented, simulated, classified, and operationalized.

[NLP-40] PIVOTSBench: Evaluating Fine-Grained Interpersonal Relationship Reasoning in Multimodal Large Language Models

【速读】: 该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度人际互动关系推理方面能力不足的问题。尽管人类天生具备理解复杂人际关系的能力,且此类认知本质上是多模态的,但当前MLLMs尚未充分探索这一领域。为此,研究提出PIVOTS,这是首个基于Social-IQ 2.0与YouTube数据构建的基准评测体系,用于评估MLLMs在心理学理论支持下预测双向人际关系维度的能力。其关键解决方案在于:一方面,通过引入辅助任务来考察模型识别并利用关键视觉线索进行推断的能力;另一方面,系统性地评估不同视觉模态及对话中显式社会角色信息的影响,并对比联合预测与成对预测设置对模型表现的提升效果。该工作为推进具有社会感知能力的生成式人工智能(Generative AI)提供了重要基准与方法论支持。

链接: https://arxiv.org/abs/2606.23092
作者: Shuxiang Zhang,Yiting Yin,Wenxuan Song,Yuhang Wu,Miao Liu
机构: Sun Yat-sen University; University of Michigan; Tsinghua University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans possess an innate ability to understand fine-grained interpersonal relationships, which is central to everyday social interactions. Although such reasoning is inherently multimodal, it remains largely unexplored by existing multimodal large language models (MLLMs). To address this gap, we introduce PIVOTS, the first benchmark built from Social-IQ 2.0 and YouTube data to evaluate MLLMs’ ability to predict bidirectional interpersonal relationship dimensions grounded in established psychology research. In addition, PIVOTS includes auxiliary tasks that assess models’ ability to identify and leverage the critical visual cues underlying such predictions. We evaluate both proprietary and open-source MLLMs and conduct detailed ablation studies to analyze the effects of visual modalities and explicit social role information in conversational utterances. We further examine how joint and pairwise prediction settings benefit MLLMs in scoring bidirectional PIVOTS dimensions. Project page and resources: this https URL .

[NLP-41] Unlimited OCR Works

【速读】: 该论文旨在解决当前端到端光学字符识别(OCR)模型在长文本生成过程中因注意力机制导致的内存占用持续增加与推理速度下降的问题。现有方法如DeepSeek OCR虽利用大语言模型(LLM)作为解码器以提升语言先验建模能力,但其依赖累积的键值缓存(KV cache),随输出序列增长而显著增加内存开销并降低生成效率,违背了人类在长距离文本复现任务中保持稳定效率的认知模式。为此,论文提出Unlimited OCR,其核心创新在于引入一种通用的参考滑动窗口注意力(Reference Sliding Window Attention, R-SWA),通过将解码器中所有注意力层替换为R-SWA,实现注意力计算成本的降低与恒定大小的KV cache,从而在不牺牲性能的前提下支持超长文档的单次前向传播处理。该设计结合编码器的高压缩率特性,使模型可在标准32K上下文长度下完成数十页文档的完整转录。值得注意的是,R-SWA作为一种通用解析注意力机制,不仅适用于OCR,还可广泛应用于自动语音识别(ASR)、机器翻译等序列生成任务。

链接: https://arxiv.org/abs/2606.23050
作者: Youyang Yin,Huanhuan Liu, YY,Qunyi Xie,Chaorun Liu,Shiqi Yang,Shaohua Wang,Zhanlong Liu,Hao Zou,Jinyue Chen,Shu Wei,Jingjing Wu,Mingxin Huang,Zhen Wu,Guibin Wang,Tengyu Du,Lei Jia
机构: Baidu Inc.(百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR’s encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at this http URL.

[NLP-42] raining Open Models for Agent ic Phone Use

【速读】: 该论文旨在解决生成式 AI 在真实手机环境中的可靠应用训练难题,核心问题在于:真实设备运行真实应用的部署环境具有状态依赖性、副作用显著且难以重置与验证,而现有的可扩展模拟环境又无法准确复现真实行为。为此,论文提出 PhoneBuddy 训练方案,其关键创新在于构建一个融合真实应用环境与基于图形用户界面(GUI)结构重建的模拟应用环境(PhoneWorld)的混合训练框架。该方案首先在两个环境中收集轨迹并进行联合监督微调,随后对比仅在真实环境中的强化学习(RL)与跨真实与模拟环境的混合强化学习(mixed RL)效果。实验结果表明,混合训练在150项真实手机任务中将任务成功率从监督微调后的36.67%提升至45.33%,在模拟环境AndroidWorld中则从60.3%提升至83.2%。研究证实,模拟应用训练并非对真实环境强化学习的替代,而是提供了一种可扩展、可重置且可自动验证的交互数据补充,尤其在单个应用及小程序任务上增益显著,但跨应用的长横向工作流仍为待解挑战。

链接: https://arxiv.org/abs/2606.23049
作者: Zhengyang Tang,Xin Lai,Pengyuan Lyu,Xinyuan Wang,Tianyi Bai,Chenxin Li,Yiduo Guo,Huawen Shen,Yuxuan Liu,Junyi Li,Zhengyao Fang,Yang Ding,Yi Zhang,Weinong Wang,Xingran Zhou,Liang Wu,Fei Tang,Sunqi Fan,Shangpin Peng,Zheng Ruan,Anran Zhang,Benyou Wang,Ji-Rong Wen,Rui Yan,Chengquan Zhang,Han Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67% after supervised fine-tuning to 40.67% after real-app RL and 45.33% after mixed RL. On AndroidWorld, the same progression rises from 60.3% to 77.2% to 83.2%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.

[NLP-43] he Model as One Rater Among Several: Measuring Political Positions in Data-Sparse Regions with a Language-Model Panel

【速读】: 该论文旨在解决现有政治立场测量工具(如政党纲领编码、专家调查、文本标度模型等)在非西方政党体系中适用性差甚至失效的问题。这些工具大多基于西方政治语境构建与验证,难以有效迁移至其他地区。其核心解决方案是将大型语言模型(Large Language Model, LLM)视为一个可信赖度有限的单一评判者,纳入一个由多个“评判者”组成的面板(panel),通过聚合多个独立判断来提升测量的可靠性,类似于专家调查中通过多专家意见整合以降低个体偏误的方法。该方法的关键创新在于引入三要素:一是设定明确的适用性规则,区分“零分”与“无数据”;二是建立“透镜系统”(lens system),将行为(what an actor does)与言论(what it says)相分离;三是通过实证检验表明,即使在无预设定义的情况下,添加轴线定义仍显著提升评分一致性(平均绝对差距从2.81降至2.50,相关系数从0.81升至0.89),且在跨实验室、跨国家的九个模型中,Krippendorff’s alpha 达到0.86,表明其具有高度可重复性。尽管该方法尚未经过人工验证,且存在解释性争议,但其设计具备良好的泛化潜力,适用于标准工具无法覆盖的区域,如中东与北非地区。

链接: https://arxiv.org/abs/2606.23042
作者: Tarek Gara
机构: 独立研究员(Independent researcher)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
备注: 21 pages, 1 figure, 7 tables. Dataset, rubric, and interactive tools: this https URL

点击查看摘要

Abstract:Most tools for measuring political positions, manifesto coding, expert surveys, text-scaling models, were built and validated on Western party systems, and outside that setting they work poorly, and often not at all. This paper is an attempt at a method for those settings. It treats a large language model not as a measurement device but as a single, fallible rater in a panel, roughly the way an expert survey treats one expert: the value comes from pooling many judges rather than trusting any one of them. I describe the panel, an applicability rule that keeps a score of zero distinct from a blank, and a lens system that separates what an actor says from what it does. I report three results. First, holding a definition-free round fixed, adding written axis definitions moves scores by a mean of 1.8 points on a 21-point scale and tightens agreement between raters (mean absolute gap 2.81 to 2.50; r 0.81 to 0.89); they make two independent raters agree more closely, which an arbitrary steer would not. Second, across nine models from eight laboratories in two countries, Krippendorff’s alpha is 0.86 on both an interval and an ordinal metric, and it stayed put as the panel grew from five raters to nine. That is reliability, the reproducibility of a reading, and not validity, its correctness. Third, where the panel does disagree, the disagreement is informative: the sharpest split, a full-scale divergence on an actor’s stance toward its state’s foundational order, points to a referent problem, and a blind triple-coding puts about two-thirds of it down to interpretation rather than error. I try to be plain about what the method can’t do, including the human validation it still lacks, and I release the instrument and data in full. The worked example is the Middle East and North Africa, but I’d expect the method to carry to any region these standard tools leave out.

[NLP-44] Have You Ever Seen Them? Entity-level Membership Inference through Interrogating Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能引发的隐私泄露与版权合规风险,特别是现有成员推理(Membership Inference)方法仅关注特定样本或样本级数据单元是否被用于训练,而忽略了模型对真实世界实体(entity)知识的隐式记忆。论文提出了一种新的“实体级成员推理”(entity-level membership inference)框架,其核心在于考察一个目标实体的相关信息是否曾被用于模型训练。关键创新在于将大语言模型类比为具有人类记忆行为的个体:即使模型未完整记忆某条具体数据,仍可能通过分散的提及累积并泄露关于某一实体的知识。为此,作者在仅可观测生成文本的标签仅黑盒(label-only black-box)设置下,形式化定义了该任务,并在线索、输入和模型约束条件下建立了可行性必要充分条件,进而设计了五种基于有限实体线索构建提示词、激发实体相关响应并利用生成文本间的语义特征进行推理的策略。实验表明,所提方法在人物实体上的AUC最高达0.97,在平衡准确率上相较最优基线提升6.0%–17.5%,验证了其有效性。

链接: https://arxiv.org/abs/2606.23030
作者: Yiran Zhu(1),Ziqi Yang(1) ((1) Zhejiang University)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) raise growing concerns about privacy leakage and copyright compliance. Membership inference is a key tool for assessing such risks, but existing studies mainly focus on whether specific samples or sample-based data units are used for training. We argue that LLMs exhibit a human-memory-like behavior: an LLM may not memorize a specific sample verbatim, yet it can accumulate and reveal knowledge about a real-world entity from scattered mentions. This analogy motivates us to examine whether an LLM can be interrogated like a human interviewee to reveal its exposure to entity-related information. Motivated by this question, we propose entity-level membership inference, which determines whether information related to a target entity is used in LLM training. We study this task in the practical label-only black-box setting, where only generated texts are observable. We formalize the task under clue, input, and model constraints, establish the necessary and sufficient conditions for its feasibility, and instantiate five interrogation strategies based on this formalization. The strategies use limited entity clues to construct prompts, elicit entity-related responses, and infer membership from semantic features among the generated texts. We construct entity-level datasets and adapt state-of-the-art sample-level label-only methods to the entity-level setting as baselines. Experiments on person entities show that our methods achieve AUC up to 0.97 and bring gains of 6.0%–17.5% in Balanced Accuracy over the best adapted baseline.

[NLP-45] Machine Translation and Post-Editing: Comparative Evaluation of Different MT Systems and Post-Editor Groups in Specialised Translation

【速读】: 该论文旨在解决专业化领域中英译法机器翻译(MT)与后编辑(PE)质量评估问题,重点关注在特定语境下不同机器翻译系统及两类后编辑人员(语言学家/译员与自然语言处理专家)的表现差异。其解决方案的关键在于采用针对机器翻译与后编辑评估优化的错误分类体系进行误差标注,并基于此对DeepL、eTranslation和Systran三种机器翻译系统进行对比分析。研究结果表明,不同系统在术语准确性与语句流畅性方面存在显著差异,且后编辑人员的背景知识(尤其是领域专长)对最终翻译质量具有决定性影响,凸显了领域知识在专业翻译中的重要性,同时揭示了当前机器翻译系统在特定用途语言(LSP)场景下的局限性与性能不稳定性。

链接: https://arxiv.org/abs/2606.23002
作者: Joachim Minder(ALTAE, CLILLAC-ARP),Alexandra Mestivier(ALTAE, CLILLAC-ARP),Natalie Kübler(ALTAE (URP 3967), CLILLAC-ARP (EA_3967))
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article aims to evaluate the quality of machine translation (MT) and post-editing (PE) in the context of specialised translation from English into French. Three MT systems (DeepL, eTranslation and Systran) were compared, and two groups of post-editors -linguists/translators and NLP experts -were asked to perform post-editing. Translation assessment is based on error annotation using an error typology adapted to MT and PE evaluation. The results reveal significant differences between the three MT systems and the two groups of post-editors, particularly in terms of terminological accuracy and fluency. This study highlights the importance of domain knowledge in specialised translation, as well as the limitations and variable performance of MT systems in language for specific purposes (LSP).

[NLP-46] Group-Graph Policy Optimization for Long-Horizon Agent ic Reinforcement Learning

【速读】: 该论文旨在解决长时程智能体强化学习(long-horizon agentic RL)中存在的奖励稀疏性与延迟问题,以及现有基于步骤的强化学习框架在信用分配(credit assignment)上仍采用粗粒度、线性化轨迹建模所带来的高方差状态值估计和局部化决策偏差。其核心挑战在于:传统方法将智能体交互视为孤立的线性轨迹,忽视了状态转移内在的图结构(graph structure),导致无法有效捕捉跨轨迹的状态关联与关键路径依赖。为克服上述瓶颈,本文提出一种新型分组图策略优化算法——Group-Graph Policy Optimization (G2PO),其关键创新在于将线性交互轨迹显式建模为全局状态转移图(state-transition graph),通过聚合不同轨迹中相同观测状态的样本实现分组聚合的状态值估计(group-aggregation state-value estimation),从而降低采样方差与轨迹依赖偏差;同时,将智能体动作重新定义为状态节点间的边(edge-centric),并引入全局标准化的时序差分(Temporal Difference, TD)误差,实现基于图结构的边缘级优势估计,精准识别并优先优化对任务进展具有决定性影响的关键状态转移。实验在WebShop、ALFWorld和AppWorld等典型长周期多轮任务基准上验证了G2PO的有效性,显著优于当前最先进的提示工程与强化学习基线,成功率提升最高达22.2%。

链接: https://arxiv.org/abs/2606.22995
作者: Yunan Wang,Minghui Song,Zihan Zhang,Shaohan Huang,Haizhen Huang,Furu Wei,Weiwei Deng,Feng Sun,Qi Zhang
机构: Peking University (北京大学); Microsoft Corporation (微软公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group-based Reinforcement Learning (RL) has significantly enhanced Large Language Models (LLMs) in agentic scenarios. To achieve finer-grained policy updates, recent agentic RL frameworks have shifted from trajectory-level to step-level training. However, long-horizon agentic RL suffers from severe reward sparsity and delay, as feedback is often deferred for dozens of interaction steps. While existing step-level frameworks refine training granularity, their credit assignment remains coarse-grained and still treats agent exploration as isolated, linear trajectories. This oversimplified perspective ignores the inherent graph structure of state transitions, leading to high-variance state-value estimation and myopic, localized credit assignment. To overcome these critical bottlenecks, we propose Group-Graph Policy Optimization (G2PO), a novel group-based RL algorithm tailored for multi-turn agentic tasks. G2PO explicitly transforms linear interaction trajectories into a global state-transition graph. By aggregating identical observations across different trajectories, we introduce group-aggregation state-value estimation that reduces sampling variance and trajectory-dependent bias. Furthermore, we redefine agent actions as transitions between state nodes and propose an edge-centric advantage estimation strategy. By globally standardizing Temporal Difference (TD) errors across the entire graph, G2PO explicitly identifies and prioritizes critical transitions that drive absolute task progress. Extensive experiments on representative long-horizon benchmarks-WebShop, ALFWorld, and AppWorld-demonstrate that G2PO substantially outperforms state-of-the-art prompt-based and RL baselines, achieving remarkable success rate improvements of up to 22.2% over GRPO.

[NLP-47] Predicate Importance Estimation and Decoupled Rationale-Score Distillation for Entity Alignment

【速读】: 该论文旨在解决工业级知识图谱检索增强生成(KG-RAG)系统中,如何有效整合来自异构数据库的公开知识图谱与领域特定知识图谱所面临的实体对齐(Entity Alignment, EA)难题。现有方法在面对谓词名称变异及局部邻域信息不完整时,仅依赖词汇匹配难以实现准确对齐。其核心解决方案包含两个互补模块:一是基于嵌入的谓词重要性估计(Predicate Importance Estimation, PIE),通过移除三元组中的主体信息并编码无主体三元组,结合可学习的谓词重要性权重,构建具备谓词感知能力的实体嵌入;二是解耦推理-得分蒸馏(Decoupled Rationale-Score Distillation, DRSD),通过教师大语言模型(LLM)生成伪答案,并以不同提示词训练一个轻量级小语言模型(SLM),将二元对齐标签转化为文本监督信号,同时解耦置信度得分估计与标签一致的推理过程,使SLM能够学习任务特异性推理,同时保留较少受标签偏差影响的置信度信号。实验表明,PIE与DRSD显著提升了实体对齐分类性能;更重要的是,由于DRSD实现了置信度与决策的解耦,当二者出现不一致时可标记为不确定预测,从而支持自动接受与人工介入验证之间的实际协同,增强了系统的可解释性与实用性。

链接: https://arxiv.org/abs/2606.22992
作者: Keunha Kim,Yoonjin Jang,Hyeon-gu Lee,Sihyung Kim,Youngjoong Ko
机构: SungKyunKwan University ( SungKyunKwan 大学); NAVER (NAVER)
类目: Computation and Language (cs.CL)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Knowledge graphs (KGs) are increasingly used as structured context for Large Language Models (LLMs), but industrial KG-RAG systems often need to integrate public and domain-specific KGs constructed from heterogeneous databases. This integration relies on Entity Alignment (EA), where lexical matching alone is insufficient under predicate-name variation and incomplete local neighborhoods. We address EA for KG integration by constructing a pairwise EA dataset and proposing two complementary modules: Predicate Importance Estimation (PIE) and Decoupled Rationale-Score Distillation (DRSD). PIE is a compact embedding-based approach that removes the subject information from each 1-hop triple, encodes the resulting subjectless triples, and aggregates them with learnable predicate-importance weights to build predicate-aware entity embeddings. DRSD trains a distilled small language model (SLM) with pseudo-answers produced by a teacher LLM through distinct prompts. By converting binary EA labels into text-based supervision and decoupling confidence-score estimation from label-consistent rationales, DRSD enables the SLM to learn task-specific reasoning while retaining a less label-biased confidence signal. Experiments show that PIE and DRSD improve EA classification. Moreover, because DRSD decouples confidence-score estimation from the decision, a discrepancy between the two flags an uncertain prediction for human review, thereby enabling a practical discrepancy between automatic acceptance and human-in-the-loop verification.

[NLP-48] StatABench: Dataset and Framework for Evaluating Statistical Analysis Capabilities of LLM s

【速读】: 该论文旨在解决现有大语言模型(LLM)在统计分析(Statistical Analysis)领域评估基准覆盖范围有限、任务形式单一的问题。当前的评测体系难以全面衡量模型在真实复杂场景下的统计推理与应用能力,尤其在工具使用、方法选择及端到端建模方面存在明显短板。为此,本文提出StatABench基准,其核心创新在于构建了两个互补的评估模块:Stat-Closed涵盖18个统计主题、404道多格式题目(包括选择题、填空题、决策题和实际应用题),用于系统考察模型的基础知识与推理能力;Stat-Open则包含30个源自专业竞赛的复杂开放性建模任务,以检验模型在真实数据分析流程中的综合表现。解决方案的关键在于采用LangChain MCP框架结合多数据科学代理(data science agents)进行模型评估,并引入经验证的“大语言模型作为评判者”(LLM-as-Judge)协议对开放任务结果进行量化评价。实验结果表明,即使最先进的模型如GPT-5.1在Stat-Closed上也仅达到68.6%准确率,而最佳开源模型仅为60.6%,在Stat-Open任务中顶级代理框架平均得分仅为61.86,揭示出当前大语言模型在工具引导推理、方法论决策和完整统计建模链条上的显著能力鸿沟。

链接: https://arxiv.org/abs/2606.22977
作者: Youxin Zhu,Yixuan Ding,Peng Lai,Longyue Wang,Bingyi Jing,Guanhua Chen
机构: Southern University of Science and Technology(南方科技大学); Alibaba Group(阿里巴巴集团); The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)); The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Statistical analysis is a broad, complex field requiring both domain knowledge and tool proficiency. While prior work has evaluated large language models (LLMs) in this domain, existing benchmarks remain limited in scope and format. To bridge this gap, we introduce StatABench (Statistical AnalysisBenchmark), a benchmark designed to systematically assess LLMs’ statistical analysis capabilities. StatABench comprises two complementary components: Stat-Closed, containing 404 questions across 18 statistical topics in multiple formats (multiple-choice, fill-in-the-blank, decision-making, and practical application), and Stat-Open, featuring 30 complex open-ended modeling tasks adapted from professional competitions. We evaluate diverse LLMs using the LangChain MCP framework and multiple data science agents, and assess Stat-Open solutions via a validated LLM-as-Judge protocol. Experiments show that even GPT-5.1 achieves only 68.6% on Stat-Closed, while the best open-source model reaches 60.6%. On Stat-Open, the top agent framework scores 61.86 on average. These results reveal the gap between current LLMs and reliable statistical analysis, highlighting persistent challenges in tool-grounded reasoning, methodological decision-making, and end-to-end statistical modeling.

[NLP-49] Understanding Parallel Samplers in Masked Diffusion via Random Walks on Graphs

【速读】: 该论文旨在解决生成式模型中并行采样策略在掩码扩散模型(Masked Diffusion Models, MDMs)中的有效性评估与优化问题,尤其关注如何在不依赖显式标签或真实数据分布的情况下,对不同并行采样方法进行可验证、可量化的性能比较。其核心挑战在于:现有广泛使用的基于最低熵等启发式准则的并行解码方法,并非在所有结构下均优于随机采样,其性能高度依赖于底层序列的潜在图结构。为应对这一问题,论文提出以图上的随机游走(Random Walks on Graphs)作为可控且可验证的“沙盒”环境,利用图的转移核隐含定义序列的生成过程,从而实现对采样器输出的有效性检验(如判断是否构成合法路径)及分布保真度的量化评估。解决方案的关键在于设计一种新型的分治采样器(bisection sampler),该方法仅需对序列长度取对数级步数即可完成采样,并在理想训练条件下具有理论保证的精确性。实验表明,不同图结构下最优采样策略存在差异,且该分治采样器在多种图游走任务及预训练OpenWebText MDM的语言生成任务中均显著提升了速度-质量权衡。整体而言,该研究将图随机游走确立为一种机制性基准(mechanistic benchmark),可用于诊断与设计适用于不同结构特性的高效并行采样算法。

链接: https://arxiv.org/abs/2606.22976
作者: Vansh Bansal,Cho Cholyeon,Syamantak Kumar,Sujay Sanghavi,Purnamrita Sarkar
机构: UT Austin SDS(德州大学奥斯汀分校数据科学学院); UT Austin CS(德州大学奥斯汀分校计算机科学系); UT Austin ECE(德州大学奥斯汀分校电子与通信工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose using random walks on graphs as a verifiable sandbox to study different parallel sampling strategies in masked diffusion models (MDMs). We train an MDM on random walk samples from a fixed graph. The graph or the transition kernel is never shown to the model explicitly and plays the role of latent structure in the sequences, albeit one that is controllable and can be used for quantitative evaluation. Thus, this framework enjoys a Sudoku-like validity check: verifying that an output is a valid walk and estimating the Markov kernel from the walks to measure distribution fidelity. Using simple graphs, we theoretically prove that parallel unmasking via widely used scores like lowest entropy is not uniformly better than a random parallel sampler; the performance critically depends on the structure of the underlying graph. We develop a new bisection sampler for random walks, which takes logarithmic steps in the sequence length and is provably exact under perfect training. Experiments on various graph walk tasks show that different parallel samplers are better for different graphs even in practice. Our initial experiments on a pretrained OpenWebText MDM show that the bisection-style samplers improve speed-quality tradeoffs even for language generation. Together, these results position graph random walks as a mechanistic benchmark for diagnosing and designing parallel samplers for masked diffusion models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.22976 [cs.LG] (or arXiv:2606.22976v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.22976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-50] Plans Dont Persist: Why Context Management Is Load Bearing for LLM Agents

【速读】: 该论文旨在解决长时程智能体(long-horizon agents)在上下文管理(context management)过程中面临的核心问题:即关键任务信息(如计划,plan)在上下文窗口有限的情况下,容易因被丢弃而失效,导致任务执行中断。其核心挑战在于,当前大语言模型(LLM)代理依赖于计划保留在上下文中以维持有效性,而非将其作为持久状态(persistent state)进行内部化存储。为诊断这一现象,论文提出“重放配对”(replay pairing)方法——通过对比有无计划历史的相同轨迹中隐藏状态(hidden-state)的余弦距离变化,量化计划信号的衰减程度。实验结果表明,在Llama-3.1-70B上,计划信号在生成后一步即达到峰值0.453,随后在单个动作-观测步骤中下降4.1倍;在HotpotQA任务中更下降12.4倍,证明标准LLM代理无法将计划作为持久状态保留。此外,论文识别出“推理痕迹混淆”(reasoning-trace confound)问题:推理模型中的think过程会重新推导计划内容,导致常规的文本剥离(stripping)方法无法有效移除计划证据。为此,提出严格剥离(strict stripping)策略,仅移除先前的think块,显著恢复了初始步的信号强度(样本内+163%,外部验证+153%),且不影响非推理模型性能。进一步发现,基于深度搜索-R1-蒸馏-兰姆达-70B模型的专用探测器可实现1.000的AUROC值,远超通用探测器(0.748),暗示该模型以不同隐藏状态方向编码计划信号。最后的压缩压力测试显示,简单地移除计划会导致ALFWorld任务成功率下降34.7个百分点,而基于探测器触发的重新浮现机制也无法完全恢复性能。综上,该研究的关键贡献是构建了一套测量与压力测试框架,揭示了代理关键信息往往仅存在于上下文中而非持久存储,强调上下文管理虽至关重要,但仅保护计划不足以保障长期任务可靠性。

链接: https://arxiv.org/abs/2606.22953
作者: Aman Mehta,Anupam Datta
机构: Snowflake AI Research (雪flake人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their think traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior think blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.

[NLP-51] Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限环境中部署时面临的高计算成本问题,尤其关注后训练阶段构建通用指令遵循模型时的知识蒸馏(Knowledge Distillation, KD)有效性。其核心挑战在于:在数据稀缺或领域特定的低资源场景下,如何高效地训练出性能优异且体积紧凑的学生模型。解决方案的关键在于提出一种两阶段知识蒸馏策略——首先利用合成教师标注数据进行初步蒸馏,随后在人类标注数据上进行精细化微调。研究表明,在低数据条件下,KD相较于监督微调(Supervised Fine-Tuning, SFT)具有显著优势;而当教师模型具备更强的指令理解能力时,即使在数据充足的情况下,KD仍能带来可观性能提升,表明其有效性的关键在于教师模型能够提供学生模型难以从原始训练数据中自主学习到的深层知识。该方法为在数据稀疏环境下构建高性能轻量化模型提供了切实可行的技术路径。

链接: https://arxiv.org/abs/2606.22942
作者: Xin Liu,Simin Ma,Shujian Liu,Song Wang,Sathish Reddy Indurthi,Haoyun Deng,Lu Wang,Kaiqiang Song
机构: University of Michigan (密歇根大学); Zoom Video Communications (Zoom视频通信)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across many tasks, but their high computational cost limits deployment in resource-constrained environments. Knowledge Distillation (KD) offers a practical solution by transferring knowledge from a teacher model of a larger size to a smaller student model. While prior work has mainly examined task-specific or small-scale settings, the post-training stage for building general instruction-following models has received limited attention. In this paper, we conduct a systematic study of KD in post-training using the large-scale Tulu 3 dataset. We find that KD outperforms supervised fine-tuning (SFT) in low-data regimes, but its advantage diminishes as more training data is added. Distilling from a stronger instruction-tuned teacher restores substantial gains even with abundant data, indicating that KD remains effective when the teacher provides knowledge that the student cannot easily acquire from the training data alone. We further study domain-specific, low-resource scenarios and propose a two-stage KD strategy that leverages synthetic teacher-labeled data followed by refinement on human annotations. This method consistently improves student performance, providing practical guidance for building compact models in data-scarce environments.

[NLP-52] Cross-lingual Retrieval-Augmented Classification for Dysarthria Severity Assessment INTERSPEECH2026

【速读】: 该论文旨在解决生成式语音病理严重程度评估中因标注病理语音数据稀缺而导致的性能瓶颈问题。其核心解决方案是提出跨语言检索增强分类(Cross-lingual Retrieval-Augmented Classification, CRAC),通过构建一个“对齐-检索-融合”(align-retrieve-fuse)的框架,利用另一种语言的语音数据来增强目标语言的分类性能。关键在于:首先采用监督对比学习(supervised contrastive learning)在嵌入空间中构建聚焦于严重程度的语义表示;随后在异语言语料库中建立向量数据库;在训练与推理阶段,分类器从对齐的嵌入空间中检索最相关的top-k参考样本,并通过交叉注意力机制将这些上下文信息与输入样本进行融合。该方法在韩语卒中后构音障碍和意大利肌萎缩侧索硬化症(ALS)构音障碍数据集上均实现了显著提升,分别达到87.3%和86.7%的平衡准确率,相较单语言基线分别提升了8.4和20.0个百分点,验证了跨语言知识迁移的有效性。

链接: https://arxiv.org/abs/2606.22910
作者: Taeyoung Jeong,Insung Lee,Du-Seong Chang,Myoung-Wan Koo
机构: Sogang University (首尔女子大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Automatic dysarthria severity assessment is limited by the scarcity of labeled pathological speech data. To address this, we propose Cross-lingual Retrieval-Augmented Classification (CRAC), which leverages speech from a different language via an align-retrieve-fuse pipeline. Supervised contrastive learning first shapes a severity-focused embedding space, then a vector database is built from the opposite-language corpus. During both training and inference, the classifier retrieves top-k references from the aligned space and fuses them with the input via cross-attention. Evaluated on Korean post-stroke and Italian ALS dysarthria datasets under a speaker-independent three-class protocol, CRAC achieves balanced accuracies of 87.3% on Korean and 86.7% on Italian, improving over monolingual baselines by 8.4 and 20.0 percentage points, respectively.

[NLP-53] Explanation-Guided Medical Named Entity Recognition with Stability and Boundary Awareness for Atopic Dermatitis

【速读】: 该论文旨在解决中文特应性皮炎(Atopic Dermatitis, AD)临床文本中医学命名实体识别(Medical Named Entity Recognition, NER)的可靠性与鲁棒性不足的问题,尤其关注模型解释性不稳定及实体边界感知能力弱的挑战。其解决方案的关键在于提出一种稳定性与边界感知相结合的解释引导学习框架:通过基于扰动的分析评估解释的稳定性与实体边界的敏感性,设计自适应融合策略动态整合局部与全局解释信号,以生成更可靠的细粒度标记级解释;进而将融合后的解释信号通过稳定性、边界感知和一致性约束引入模型训练过程,从而在提升识别性能的同时增强解释的可信度。该方法实现了对多种NER模型的一致性性能提升,并为可解释医疗NER提供了通用且实用的解决方案,支持下游临床决策与医学知识应用。

链接: https://arxiv.org/abs/2606.22886
作者: Xueguang Li(1),Di Lin(1),Xue Jiang(2),Yanxi Li(2),Yugang Chi(3) ((1) School of Information and Software Engineering, University of Electronic Science and Technology of China, Sichuan, China (2) Department of Dermatology, Chongqing Traditional Chinese Medicine Hospital, Chongqing, China (3) Chongqing Health Center for Women and Children, Chongqing, China)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Corresponding author: Xue Jiang, E-mail: xuejiang1025@126.com

点击查看摘要

Abstract:Objective: This study aims to improve the reliability and robustness of medical named entity recognition (NER) in Chinese atopic dermatitis (AD) clinical texts through explanation-guided learning. Methods: We propose a stability and boundary-aware explanation-guided NER framework. Perturbation-based analysis is used to evaluate explanation stability and entity boundary sensitivity. An adaptive fusion strategy dynamically combines local and global explanation to generate more reliable token-level explanations. The fused explanation signals are further incorporated into model training through stability, boundary-aware, and consistency constraints. Results: Experiments on Chinese AD NER datasets show that the proposed framework improves explanation robustness and achieves consistent performance gains across multiple NER models. The adaptive fusion strategy also provides more stable explanations and stronger boundary perception than individual explanation methods. Conclusion: The proposed method effectively integrates reliable explanation signals into medical NER training, improving both recognition performance and explanation reliability. The framework provides a practical and generalizable solution for explainable medical NER and offers reliable support for downstream clinical decision-making and medical knowledge applications.

[NLP-54] DynamicMem: A Long-Horizon Memory Benchmark in Real-World Settings

【速读】: 该论文旨在解决大语言模型(LLM)代理在长期使用中维持用户个人画像(profile)的挑战,即如何持续、准确地追踪用户属性(attributes)、习惯(habits)和偏好(preferences),并随时间动态更新。现有基准测试因仅采用短期、简化的交互数据,忽略了真实行为中的三个核心特性:用户画像具有异质性,其属性、习惯与偏好在不同时间尺度上演化;变化受外部上下文(如季节更替、人生事件)驱动;且相关证据通常未被显式陈述,而是分散于跨多个应用的细粒度行为信号中,需由记忆系统进行隐式推断。为克服上述局限,论文提出DynamicMem——一个合成基准,构建每位用户长达15个月的多应用活动轨迹,涵盖电商、健身、社交等16类应用,生成平均220万词元和1,772个可验证事件的长周期数据,确保用户画像在不暴露真实隐私的前提下实现动态演化。该基准通过五个季度检查点评估系统在历史增长下的表现,揭示出单一准确率指标无法反映的真实问题:(i) 随着历史长度增加,画像重建性能下降,而服务任务准确率保持稳定,表明两者共享同一记忆但表现分化;(ii) 无一系统能同时保留不变的事实并及时替换已变更的事实,错误集中于偏好判断及指代对象识别;(iii) 超过93%的失败源于记忆检索环节而非模型生成,凸显当前记忆系统本身是主要瓶颈。因此,解决方案的关键在于构建具备时空敏感性、多源信号融合能力与动态一致性维护机制的长时记忆架构,以提升对非显式、分散式行为证据的推断能力。

链接: https://arxiv.org/abs/2606.22877
作者: Wenya Xie,Shengming Zhou,Zelin Li,Pouya Parsa,Shuang Zhou,Xinheng Ding,Chinmay Arvind,Guanchu Wang,Vladimir Braverman,Ali Payani,Yantao Zheng,Zirui Liu
机构: University of Minnesota; University of North Carolina at Charlotte; Johns Hopkins University; Cisco; Adobe
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents increasingly act as personal assistants that must remember a user’s profile over months: who they are (attributes), what they routinely do (habits), and what they prefer (preferences), and keep it updated as jobs, routines, and tastes drift. Existing benchmarks evaluate this “memory” ability through short, simplified interactions, missing three core properties of real behavior: the profile is heterogeneous, with attributes, habits, and preferences evolving on different timelines; changes are driven by external context such as seasons and life events; and evidence is rarely stated explicitly, instead scattered across many small actions in different apps that a memory system must infer from. We introduce DynamicMem, a synthetic benchmark that constructs 15 months of activity per user, providing long-term multi-app data that real users’ privacy keeps out of reach. It provides user-consistent trajectories averaging 2.2M tokens and 1,772 grounded events per user across 16 applications such as e-commerce, fitness, and social platforms. The profile evolves over this period and is never given explicitly: each attribute, habit, or preference must be inferred from small signals scattered across apps. We evaluate at five quarterly checkpoints to track how systems scale as history grows. Benchmarking five representative systems exposes problems a single accuracy score hides: (i) profile reconstruction degrades with history length while service-task accuracy stays flat, despite both drawing on the same memory; (ii) no system both keeps facts that stay true and replaces facts that change, with errors clustering on preferences and on naming the exact referent; and (iii) over 93% of failures trace to what the memory retrieves, not to the model writing the answer, so the largest room for improvement lies in memory itself. Code: this https URL

[NLP-55] SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

【速读】: 该论文旨在解决多模态对话系统中安全评估的动态适应性问题,特别是在部署阶段安全策略频繁变更、跨产品/区域/场景存在差异的情况下,现有防护机制因依赖静态分类体系或仅适用于特定交互场景而难以有效应对。其核心解决方案是提出SingGuard——一种面向多模态对话的安全评估框架,其关键在于将当前生效的安全政策作为运行时输入,通过逐条规则匹配的方式对内容进行判断,实现安全标签与触发规则的双重输出。为兼顾效率与可解释性,SingGuard支持从快速直接判断到基于策略推理的慢速推演的多层次推理模式,并采用“快-慢解耦”的强化学习方法优化行为表现。此外,研究构建了包含56,340个样本、覆盖80余种细粒度风险类型的多模态防护基准测试集(SingGuard-Bench),涵盖多模态问答、对抗攻击及动态规则评估等场景,尤其包含跨模态联合风险案例(单模态无害但组合后隐含不安全意图)。实验表明,SingGuard在六个基准族共35个数据集上均达到最优平均F1值,且在动态规则切换下政策遵循准确率由0.6465提升至0.7415,显著增强了安全防护的灵活性与鲁棒性。

链接: https://arxiv.org/abs/2606.22873
作者: SingGuard Team
机构: AI Security Lab, Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present \textbfSingGuard, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast–slow decoupled reinforcement learning. We also introduce \textbfSingGuard-Bench, a multimodal guardrail benchmark with 56,340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at this https URL.

[NLP-56] IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages

【速读】: 该论文旨在解决大语言模型(LLM)在印地语系(Indic)语言区域中安全对齐与文化敏感性不足的问题。当前主流的安全机制多基于以英语为中心的框架,难以有效识别印地语系地区特有的社会文化敏感点及局部危害类别,导致模型在跨语言场景下存在显著的安全漏洞。为此,论文提出IndicGuard——一个专为印地语系语言设计的多语言安全防护模型及其配套数据集。其核心解决方案在于构建了一个涵盖十种主要印地语系语言的高容量、文化语境精细的安全数据集,系统性地覆盖区域性危害、敏感政治社会情境以及对抗性越狱攻击样本;在此基础上,基于Gemma-3-4B-IT架构微调一个40亿参数的指令调优模型,作为实时内容审核与政策合规检查的多语言安全护栏。实验结果表明,IndicGuard显著提升了模型对本地化漏洞的鲁棒性,在多轮对话中保持高度一致的内容审核能力,并在所有评估语言上均优于现有基线模型CultureGuard。更重要的是,该模型在未参与训练的低资源印地语系语言上仍表现出良好的泛化性能,验证了所提框架在结构鲁棒性与跨语言迁移方面的有效性。

链接: https://arxiv.org/abs/2606.22841
作者: Parth Bramhecha,Smit Deshmukh,Sairaj Bodhale,Adwait Borate,Raviraj Joshi
机构: L3Cube-Labs, Pune
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) achieve widespread integration across diverse linguistic landscapes, ensuring their safety and alignment with regional normative values remains a critical challenge. Current safety mechanisms are predominantly optimized for English-centric frameworks, often failing to capture the unique socio-cultural sensitivities and localized categories of harm inherent to the Indic region. To address this gap, we introduce IndicGuard, a multilingual safety guard model and dataset for Indic languages. We construct a high-volume, culturally nuanced safety dataset encompassing ten major Indic languages, systematically curated to capture regional harms, sensitive socio-political contexts, and adversarial jailbreaks. Leveraging this corpus, we fine-tune a 4B-parameter instruction-tuned model based on Gemma-3-4B-IT to serve as a multilingual safety guardrail for real-time content moderation and policy compliance checking. Our empirical evaluations demonstrate that IndicGuard significantly enhances LLM robustness against localized vulnerabilities, achieving high moderation consistency across different conversational turns. Crucially, IndicGuard consistently outperforms the existing baseline model, CultureGuard, across evaluated languages. Finally, we demonstrate that our model effectively generalizes to low-resource Indic languages excluded from training, substantiating the structural robustness and cross-lingual transfer capabilities of the framework.

[NLP-57] Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

【速读】: 该论文旨在解决传统语音合成(Text-to-Speech, TTS)系统依赖固定输入格式和预定义元数据槽位,难以灵活响应多样化自然语言用户需求的问题。其核心解决方案是提出一种通用型语音合成系统Bagpiper-TTS,该系统能够直接处理自然语言提示,并通过意图推理生成包含转写文本与细粒度元数据的丰富描述(rich caption),作为目标语音合成的全面文本蓝图。该方法的关键在于将自然语言理解与语音合成深度融合,使模型不仅能完成标准TTS任务,还可支持多说话人、意图到语音、角色扮演合成、歌唱语音合成等多种复杂场景。实验结果表明,Bagpiper-TTS在Seed-TTS-Eval基准上达到1.7%的词错误率(Word Error Rate, WER),且在大语言模型评估(LLM-as-a-judge)和人工主观评价中表现与专用模型相当,验证了其通用性与高性能。

链接: https://arxiv.org/abs/2606.22811
作者: Jinchuan Tian,Haoran Wang,Siddhant Arora,Takashi Maekaku,Keita Goto,Jin Sakuma,Yusuke Shinohara,Chao-Han Huck Yang,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学); LY Corporation; NVIDIA Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical TTS systems typically rely on rigid input formats and predefined metadata slots, limiting their ability to fulfill flexible user requirements. This paper introduces Bagpiper-TTS, a universal speech synthesis system that deals with diverse natural language user requests. Given a natural language prompt, Bagpiper-TTS first reasons over the users’ intent to derive a rich caption, i.e., a comprehensive textual blueprint encompassing both transcription and nuanced metadata. Subsequently, this caption guides the synthesis of the target speech. Our model inherently supports a broad spectrum of tasks besides classical TTS applications, including multi-talker, intent-to-speech, role-play synthesis, singing voice synthesis, and more. Experimental results demonstrate that Bagpiper-TTS achieves an 1.7% Word Error Rate (WER) on the Seed-TTS-Eval benchmark and match the performance of dedicated models in both LLM-as-a-judge and human subjective evaluations across multiple applications.

[NLP-58] KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

【速读】: 该论文旨在解决大规模检索系统中高质量重排序(reranking)面临的计算效率与灵活性受限问题。现有基于编码器或解码器的重排序模型普遍采用查询(query)与文档片段(passage)联合编码的方式,导致二者计算紧密耦合,限制了部署效率与系统灵活性。为此,本文提出一种“无需后期交互”(fast but not late-interaction, FBNL)的重排序框架KaLM-Reranker-V1,其核心创新在于通过解耦查询与文档片段的计算流程,在保持丰富相关性建模能力的前提下显著提升效率。具体而言,该模型基于编码器-解码器架构:编码器预先对文档片段进行Matryoshka嵌入池化处理以实现高效编码;解码器则负责建模系统指令、用户指令及查询意图;最终通过交叉注意力(cross-attention)机制捕捉查询上下文与文档表示之间的深层相关性。这一设计实现了高效的分离式文档编码,同时避免了传统晚期交互(late interaction)带来的性能损失。实验在BEIR、MIRACL和LMEB等多个基准上验证了该方法的有效性,结果显示,KaLM-Reranker-V1在不同规模(Nano/Small/Large,参数量分别为0.27B/1B/4B激活参数)下均达到领先性能,尤其在小模型(如0.27B Nano)上仍能媲美7–12B的嵌入模型,展现出卓越的性价比与泛化能力。

链接: https://arxiv.org/abs/2606.22807
作者: Xinping Zhao,Jiaxin Xu,Ziqi Dai,Xin Zhang,Shouzheng Huang,Danyu Tang,Xinshuo Hu,Meishan Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen); Shenzhen Loop Area Institute (SLAI)
类目: Computation and Language (cs.CL)
备注: Technical Report; Work in Progress

点击查看摘要

Abstract:As retrieval systems scale, high-quality reranking becomes increasingly important. However, most existing rerankers, whether encoder-based or decoder-based, jointly encode the query and passage, tightly coupling their computation and limiting deployment efficiency as well as flexibility. We present KaLM-Reranker-V1, a fast but not late-interaction (FBNL) reranker that decouples query and passage computation while retaining expressive relevance modeling. Built on an encoder-decoder architecture, KaLM-Reranker-V1 uses the encoder to pre-encode passages with Matryoshka embedding pooling, while the decoder models the system instruction, user instruction, and query intent; cross-attention then captures relevance between the query context and passage representations. This design makes KaLM-Reranker-V1 efficient through decoupled passage encoding, yet not late interaction, by preserving rich relevance modeling through cross-attention. We instantiate KaLM-Reranker-V1 in three sizes, Nano, Small, and Large, with 0.27B, 1B, and 4B activated parameters, respectively. Extensive experiments on BEIR, MIRACL, and LMEB demonstrate that KaLM-Reranker-V1 achieves strong reranking performance with superior efficiency. On BEIR, KaLM-Reranker-V1 achieves state-of-the-art performance, on par with strong industrial models such as the Qwen3-Reranker series; on MIRACL, despite not being extensively trained on multilingual data, KaLM-Reranker-V1 still shows excellent reranking performance. Moreover, on LMEB, reranking models demonstrate a clear advantage, with even the 0.27B Nano model remaining competitive with 7-12B embedding models.

[NLP-59] Does the Same Token Mean the Same State? MoE Routing as Signal for Reasoning Control

【速读】: 该论文旨在解决稀疏混合专家(Mixture-of-Experts, MoE)语言模型中,相同输出标记(token id)是否对应一致的路由状态与专家激活模式这一关键问题。研究发现,即使固定输出标记在重复锚点处,其对应的专家组合仍会因任务上下文、轨迹历史及推理努力模式的不同而产生差异,表明路由行为具有隐含的残差结构。解决方案的核心在于提出一种名为**路由一致性解码(Routing Agreement Decoding, RAD)**的新方法:它不依赖于答案字符串的解析、归一化或投票,而是通过在固定锚点窗口内提取各生成路径的路由状态,基于加权杰卡德相似度(Weighted-Jaccard)计算路由聚类中心,选择密度最高的“路由基底”(route-basin center)作为最终输出。该方法在数学、GPQA和代码等多个基准上表现稳健,与字符串投票法相当,并在无明确答案字符串的场景(如代码生成、SWE-bench Verified)中展现出显著优势,其价值在于提供了一种统一且无需答案字符串的多轮次选择接口,实现了对复杂推理路径的高效聚合与控制。

链接: https://arxiv.org/abs/2606.22798
作者: Kang Chen,Minshen Yu,Junjie Nian,Yaoning Wang,Yixin Cao,Yugang Jiang
机构: Fudan University (复旦大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In sparse Mixture-of-Experts language models, does the same token id imply the same router state and the same experts producing it? Holding the emitted token id fixed at repeated anchors, we find it does not: the experts that produce it still separate task context, trajectory history, and reasoning-effort mode. This residual structure supports test-time control: near \emphboundary anchors (the final-response transition) and \emphdelimiter anchors (which open the answer, e.g.\ \texttt\textbackslash boxed\ or code fences), routing neighborhoods already align with final-answer basins at a marker-only readout and strongest when the routing is read at the answer opening. We operationalize this as \textbfRAD (Routing Agreement Decoding), an answer-string-free multi-rollout selector: it locates a fixed anchor, represents each rollout by its anchor-window MoE routing states, and returns the densest Weighted-Jaccard K -NN route-basin center, without parsing, normalizing, executing, or voting over answer strings. Across 10 sparse-MoE configurations (gpt-oss, Qwen3-MoE) and 6 datasets spanning math, GPQA, and code, RAD is on par with Majority where string voting is well-posed, with small positive paired deltas (RAD 73.9 / RAD+DC 74.2 vs.\ Majority 73.6 ). Like majority voting, RAD is not a verifier: a dense \emphwrong basin can still win. Its value is the interface: the same selector gives direct pass@1 on code, where exact-string voting is ill-defined, and the same routing-density principle, re-anchored to the agentic boundary, improves best-of-16 patch selection on SWE-bench Verified over random, where patches have no answer string to vote on.

[NLP-60] Cross-National Information Attacks: A Two-Decade Analysis of Troll Behavior in Korea USENIX-SECURITY USENIX-SECURITY’26

【速读】: 该论文旨在解决协调性外国影响力操作对在线平台构成的日益增长威胁,特别是针对国家关联账号在韩国新闻评论区中开展的有组织操控行为的检测与演化追踪难题。其解决方案的关键在于提出一种可解释的机器学习框架,基于理论指导,对评论进行多维度分类:包括外国来源、道德情感化叙事框架以及目标国家。该框架通过提取细粒度的文本片段(span-level textual evidence),实现模型决策过程的可解释性,为人工研判提供可读性强的依据。研究基于近20年、超过1.12亿条韩国新闻评论及400万用户数据,识别出23,998个符合协同操纵特征的账号。分析发现,这些账号主要依赖道德谴责式话语而非直接传播外方立场,且此类内容获得显著更高的用户互动;在高互动评论中,道德谴责最常指向左右翼政界人物,可能加剧政治极化。该框架通过可解释、证据驱动的自动化分析,支持平台透明治理,并为平台与监测机构提前识别高风险叙事-目标组合、实施干预提供了关键依据。

链接: https://arxiv.org/abs/2606.22785
作者: Jaehong Kim,Hyeonseung Kim,Jiseon Kim,Alice Oh,Thorsten Holz,Wonjae Lee,Meeyoung Cha
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); Max Planck Institute for Security and Privacy (MPI-SP)(马克斯普朗克信息安全与隐私研究所)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at the 35th USENIX Security Symposium (USENIX Security '26)

点击查看摘要

Abstract:Coordinated foreign influence operations pose a growing threat to online platforms, but detecting state-linked troll activity and tracking its evolution remain challenging. This paper presents an explainable machine learning framework for theory-guided detection and longitudinal analysis of suspected trolling within Korean online news comment sections. Our hierarchical model classifies comments along three dimensions central to influence campaigns: foreign origin, moral-emotional framing, and target country. To support explainability, it also extracts brief span-level textual evidence that provides human-interpretable rationales. We apply the approach to 112M South Korean news comments authored by 4M users over nearly 20 years, identifying 23,998 accounts exhibiting behavior consistent with coordinated manipulation. Analyzing these accounts, we find that they predominantly rely on morally condemning rhetoric rather than direct promotion of foreign-aligned narratives; this rhetoric receives significantly higher user engagement. Among the highest-engagement comments, the moral condemnation most frequently targets domestic political figures (e.g., presidents or party leaders) on both the left and the right, potentially amplifying polarization. Our framework supports transparent platform governance through explainable, evidence-based moderation. These observed rhetorical and engagement patterns can inform how platforms and observatories prioritize defenses and intervene before harmful narrative-target combinations achieve widespread reach.

[NLP-61] Learning Moral Diversity: Modelling Individual Perspectives in Moral Classification of Texts

【速读】: 该论文旨在解决社交媒体文本中道德判断的主观性问题,传统监督式自然语言处理(NLP)模型通常将多个标注者的标签聚合为单一“真实标签”,从而忽略了道德判断本身固有的主观性和分歧。针对短文本(如推文)中存在的标注者间不一致现象,其关键解决方案是:在预训练语言模型基础上引入一个学习标注者特定特征的层级模块,使模型能够捕捉个体标注者的道德视角。该方法不仅提升了对单个标注结果的预测能力,还生成了可揭示标注者道德立场的语义表示。研究发现,基于聚合标签训练的模型可能掩盖标注差异,导致性能评估失真;而建模个体视角能更真实反映任务的主观本质,并显著提升文本道德分类的效果。

链接: https://arxiv.org/abs/2606.22771
作者: Yi Ren,Lewis Mitchell,Matthew Roughan
机构: Adelaide University (阿德莱德大学)
类目: Computation and Language (cs.CL)
备注: Accepted at the Seventh Workshop on NLP and Computational Social Science. 12 pages, 7 figures

点击查看摘要

Abstract:Understanding moral values in social media text offers insight into moral judgement formation, and supervised NLP models trained on crowdsourced data have achieved strong classification performance. However, most approaches simplify the problem by aggregating multiple annotators’ labels into a single “ground truth”, overlooking the inherent subjectivity of the task. In practice, there are disagreements between annotators caused by personal viewpoint or inherent ambiguities, particularly for short tweets. Here, we extend a pretrained language model with a layer that learns annotator-specific features. Our model improves predictions of individual annotations and yields representations that reveal meaningful insights into annotators’ moral perspectives. We show that models trained on aggregated labels may hide variation and give a misleading impression of performance. Overall, we demonstrate that disagreement reflects the inherent subjectivity of the task and that modelling individual perspectives creates benefits for moral classification of texts.

[NLP-62] AI Fiction in the Wild

【速读】: 该论文旨在探讨大规模语言模型(Large Language Models, LLMs)如何重塑小说的生产与消费模式,特别是揭示读者在使用生成式AI(Generative AI)工具进行虚构内容创作中的参与现象。研究基于超过50万条匿名的英文ChatGPT用户对话数据(arXiv:2405.01470),发现超过三分之一的对话涉及某种形式的虚构内容生成,包括原创故事、角色扮演、同人小说及情色文学。其核心解决方案的关键在于识别出高活跃用户群体中普遍存在的“无限故事需求者”行为模式——即长期反复请求并修改相同或相似叙事变体的用户。研究进一步指出,用户对同人小说和情色内容具有显著偏好,且普遍倾向类型化、重复性、即时性和特定元素的细分组合。这些发现推动了两个理论性思考:一是生成式AI可能促使作者与读者之间的传统关系发生转变,催生一种“自指型读者-作者”(solipsistic reader-writer)身份,即个体在封闭的对话循环中同时作为创作者与消费者,与机器而非人类互动;二是大语言模型所赋予的交互性、游戏性和叙事可变性为用户带来显著愉悦感,从而引发关于生成式AI在当代叙事与娱乐生态中定位的深层问题。研究将这一现象置于自出版、同人文化及成人内容等更广泛的文化变迁背景下,指出生成式小说在结构上与按需定制、高度个性化且重复性的文化形式存在内在契合。

链接: https://arxiv.org/abs/2606.22748
作者: Neel Gupta,Maria Antoniak,Melanie Walsh
机构: University of Washington (华盛顿大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Presented at the MFS Cultural AI Conference, Purdue University, September 19, 2025. This essay is provisionally forthcoming in MFS: Modern Fiction Studies

点击查看摘要

Abstract:Some professional authors are beginning to use AI tools to help produce their fiction writing. Are readers using AI to generate fiction, too? This paper examines how large language models are reshaping the production and consumption of fiction by enabling new forms of participation in narrative generation. Drawing on over 500,000 anonymized, English-language ChatGPT-user conversations (arXiv:2405.01470), we find that more than one third of the conversations involve some form of fiction generation – including original stories, roleplay, fanfiction, and erotica. This AI-generated fiction is notably dominated by power users. We identify common fiction generation patterns and profiles among these users, including what we call “infinite story demanders,” who repeatedly request and revise variations of the same or similar narratives over extended periods of time. We show that users especially gravitate toward fanfiction and erotica, and that they are broadly drawn to generic forms, repetition, immediacy, and niche combinations of story elements. Our findings motivate two theoretical provocations. First, we argue that AI technologies may lead to a shift in the conventional relationship between the author and reader, potentially producing what we call a “solipsistic reader-writer,” who both generates and consumes fiction within a closed conversational loop, interacting with a machine rather than a human other. Second, we note that LLMs enable interactivity, play, and permutation in ways that are seemingly pleasurable for users, raising questions about where AI will fit into contemporary storytelling and entertainment ecosystems. We situate these developments within broader transformations in literature and media, including self-publishing, fanfiction, and pornography, and suggest that AI-generated fiction shares structural affinities with on-demand, personalized, and repetitive cultural forms.

[NLP-63] Language-Specific Sentiment Polarity Biases in Encoder and Large Language Model Classification of Product Reviews

【速读】: 该论文旨在解决多语言场景下生成式AI在情感极性分类中存在的系统性偏差问题,即不同语言和模型架构对正面与负面评论的识别准确率存在差异。研究发现,大型语言模型在法语中表现出对负面情感的偏倚,对负面评论的识别更为准确;而编码器类模型在日语中则呈现正面偏倚,难以捕捉采用间接批评表达的负面评论。其解决方案的关键在于揭示并量化这些语言特异性的情感极性偏差,强调在构建跨语言情感分析系统时需针对不同语言和模型结构进行偏差校准,以提升多语言应用中情感分析的公平性与准确性。

链接: https://arxiv.org/abs/2606.22745
作者: Advita Rajiv,Kavitha Kothur,Gautham Reddy
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, 3 tables

点击查看摘要

Abstract:This study investigates sentiment polarity biases, specifically, differences in how accurately AI models classify positive versus negative reviews across languages and model architectures. Large language models show a negative bias in French and are more accurate on negative reviews, while encoder models exhibit positive bias in Japanese, missing negative reviews that use indirect criticism. These language-specific polarity biases have implications in both social and business domains deploying multilingual sentiment analysis systems.

[NLP-64] GroundEval: A Deterministic Replacement for LLM -as-Judge in Stateful Agent Evaluation

【速读】: 该论文旨在解决大模型代理(Agent)在真实环境中执行任务时,其决策所依赖的证据是否真实、可追溯且符合时间与访问权限约束的问题。传统基于大语言模型(LLM)作为裁判的评估方法难以发现代理在回答中使用了未检索到的证据、基于错误因果机制推理或在未验证的情况下声称信息缺失等关键缺陷。针对这一问题,论文提出GroundEval——一种无需人工裁判的评估框架,其核心在于通过领域配置生成测试问题,允许代理自主选择响应策略,并对最终答案及其生成轨迹进行双重评分。该框架的关键创新在于引入三个评估维度:沉默(Silence)(检测代理是否在声称无信息前进行了有效查询)、视角(Perspective)(验证推理是否仅基于代理在特定时刻可获取的证据)、以及反事实(Counterfactual)(判断是否依赖于看似合理但不正确的因果机制)。通过结构化地关联工具调用行为与代理的逐轮叙述,GroundEval实现了评分过程的可解释性与可审计性,能够精准揭示“表面合理但证据路径无效”的虚假可信回答,从而暴露现有评估体系的盲区。

链接: https://arxiv.org/abs/2606.22737
作者: Jeffrey Flynt
机构: Jeffrey Flynt; University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Before letting an agent operate over real context, can you prove it used the right evidence? GroundEval turns that question into a deterministic test of what the agent searched, fetched, cited, and was permitted to access. In one case study, two frontier LLM judges scored a plausible agent response above 0.85. But the trace told a different story: the agent had never retrieved the artifact its answer depended on, yielding a GroundEval score of 0.000. We introduce GroundEval, a judge-free framework for evaluating agents against grounded, time-bounded, and access-controlled evidence. GroundEval uses a domain configuration to generate questions, lets the agent choose how to answer, and then scores both the final answer and the recorded trajectory that produced it. The benchmark targets three failures that LLM-as-judge evaluation struggles to detect: whether an agent checked before claiming absence, reasoned only from evidence available to the actor at the relevant time, and used the correct causal mechanism rather than a plausible one. These correspond to three tracks: Silence, Perspective, and Counterfactual. GroundEval exposes when plausible answers rest on invalid evidence paths, and produces structured per-question diagnostics that pair tool activity with the agent’s turn-level narration, making each score inspectable rather than merely reported. What our case studies turned up is that this gap isn’t some rare corner case. It’s exactly the blind spot that final-answer and judge-based scoring were never built to catch. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2606.22737 [cs.AI] (or arXiv:2606.22737v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.22737 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-65] When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

【速读】: 该论文旨在解决检索增强生成(RAG)系统中因“检索状态锁定”(retrieval-state lock-in)导致的可信度误判问题。其核心挑战在于:现有黑箱不确定性评估方法将多次采样中答案的一致性误认为置信度,而当检索状态本身存在缺陷(如为空或包含一致但错误的上下文)时,尽管答案高度一致,系统仍可能输出错误结果。该问题在实际部署中已知存在,但长期缺乏明确命名、可测量的特征指标及发生频率的量化边界。本文首次提出“检索状态锁定”这一概念,并通过解耦单个置信度分数所混淆的三个关键对象——答案表面(answer surface)、检索证据(retrieved evidence)与检索状态(retrieval state)——实现诊断。在基于本体引导的知识图谱RAG(KG-RAG)系统上,六组问答快照的实验表明,在每题5次采样下,42%的KG-RAG错误和59%的密集检索错误呈现零答案离散度,即答案一致性无法反映真实可靠性;而对证据与检索状态的独立检查仍能识别出大部分错误。由此构建的可审计决策规则——仅当答案、证据与检索状态检查均指示低风险时才接受答案——实现了91.9%的综合精确率,显著优于全接受策略的69.7%,代价是覆盖率下降至仅7.7%的低风险答案被认证。在临床校准领域,该规则在自动评判下达到100%精确率,此为领域内自动化标签的理论上限,非临床安全声明,仍需人工验证。研究强调,RAG中的置信度应具有对象特异性:当答案一致时,真正的问题不是是否相信答案,而是应质疑流水线中哪一个环节的可靠性。

链接: https://arxiv.org/abs/2606.22728
作者: Sahib Julka
机构: LMU University Hospital, LMU Munich (慕尼黑大学医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The trustworthiness of a retrieval-augmented generation (RAG) system depends on more than the answer it returns, yet many black-box uncertainty methods still read agreement among sampled answers as confidence. That inference fails when repeated samples condition on the same defective retrieval state. The state may be empty, with the model falling back on parametric memory, or populated by a coherent but wrong neighbourhood. In either case, the answers agree because the error is stable. The problem is recognised in deployed RAG, but it has lacked a name, a measurable signature, and a prevalence bound. We supply all three. We name the failure retrieval-state lock-in and diagnose it by separating the three objects a single confidence score conflates: the answer surface, the retrieved evidence, and the retrieval state itself. In an inspectable, ontology-guided knowledge-graph RAG (KG-RAG) system across six question-answering snapshots, we measure the agreement blind spot directly: at five samples per question, 42% of KG-RAG errors and 59% of dense-retrieval errors carry zero answer dispersion, so agreement has nothing to rank, while evidence- and retrieval-state checks still flag most of them. The decomposition supports an auditable decision rule: accepting an answer only when answer, evidence, and retrieval checks all agree that it is low-risk reaches 91.9% pooled precision against a 69.7% accept-all rate. The cost is coverage: it certifies only 7.7% of answers as low-risk. On the clinical calibration domain it reaches 100% precision under an automated judge; this is an in-domain automated-label upper bound, not a clinical safety claim, and still needs human validation. Confidence in RAG is object-specific: when answers agree, the useful question is which part of the pipeline to distrust.

[NLP-66] BLUEX v2: Benchmarking LLM s on Open-Ended Questions from Brazilian University Entrance Exams

【速读】: 该论文旨在解决生成式 AI 在葡萄牙语环境下,特别是在需要深度推理与自由文本生成能力的开放性论述类任务中评估数据稀缺的问题。现有基准(如原始 BLUEX)仅涵盖选择题形式的巴西大学入学考试,未能覆盖更具挑战性的第二阶段考试——此类考试要求考生提交自由作答的书面回答。为填补这一空白,研究提出 BLUEX v2,一个基于巴西两所顶尖高校(UNICAMP 和 USP)2022–2025 年第二阶段入学考试的新型评测基准。该基准包含 395 道题目,衍生出 919 个评分子问题,其中 55.7% 的题目配有图像。每道题均标注学科领域、官方参考答案、由大模型生成的评分标准及六类认知能力标签。通过采用“大模型作为裁判”(LLM-as-a-judge)的评估协议,对 21 个先进大模型进行测评,结果显示模型性能跨度达 4.92 分(0–10 分制,范围为 4.18–9.10),且数学推理与图像理解成为最困难的能力维度。该研究的关键解决方案在于构建了一个高质量、多模态、结构化标注的葡萄牙语开放生成评测基准,并配套公开了数据集、评估代码与模型输出结果,显著推动了生成式 AI 在葡语复杂任务中的可信评估发展。

链接: https://arxiv.org/abs/2606.22723
作者: João Guilherme Alves Santos,Giovana Kerche Bonás,Thiago Laitz,Thales Sales Almeida,Helio Pedrini
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While the original BLUEX benchmark addressed the scarcity of Portuguese evaluation datasets through multiple-choice questions from Brazilian university entrance exams, it did not cover the more challenging second-phase examinations, which require free-form written responses. In this work, we introduce BLUEX v2, a benchmark derived from the second-phase entrance exams of Brazil’s two leading universities: UNICAMP (Comvest) and USP (Fuvest), spanning exam years 2022-2025. Our dataset comprises 395 questions unfolding into 919 graded subquestions, with 55.7% of questions containing associated images. Each question is annotated with subject area, official reference answers, LLM-generated rubric criteria, and six cognitive capability tags. We evaluate 21 state-of-the-art LLMs using an LLM-as-a-judge protocol. Results reveal a 4.92-point performance spread across models (4.18-9.10 on a 0-10 scale), with Mathematical Reasoning and Image Understanding emerging as the hardest capability dimensions. The dataset, evaluation code, and model outputs are publicly available at this https URL.

[NLP-67] moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT

【速读】: 该论文旨在解决葡萄牙语自然语言处理(Natural Language Processing, NLP)中缺乏高性能、专为长文本上下文优化的预训练模型的问题。尽管生成式AI(Generative AI)在多种任务中表现出色,但编码器仅有的变压器模型(Encoder-only Transformer models)在生产级NLP流水线中仍具关键地位,尤其在信息检索、文档分类和命名实体识别等判别性任务中。为此,论文提出moBERTo,即基于ModernBERT-base模型通过在600亿词元(5轮遍历120亿词元的细粒度筛选语料库,源自FineWeb2并经教育与STEM分类器过滤)上持续预训练得到的葡萄牙语专用模型。其解决方案的关键在于:(1)保持原始架构特性,包括旋转位置嵌入(rotary positional embeddings)、交替局部-全局注意力机制、Flash注意力与非填充处理(unpadding),以高效支持长序列建模;(2)采用子词匹配嵌入迁移与专用长上下文后训练策略,显著提升长文本检索与命名实体识别性能;(3)通过系统性消融实验验证持续预训练优于从零训练,且长上下文微调对重排序nDCG@10和NER任务具有显著增益。结果表明,moBERTo在三个葡萄牙语检索基准上的平均重排序nDCG@10达到最优,并在PLUE-PT任务中表现最佳,同时证明编码器仅架构在判别任务中仍可媲美更大规模的解码器仅模型。研究团队已将模型权重与训练数据公开发布于Hugging Face平台。

链接: https://arxiv.org/abs/2606.22722
作者: Thiago Laitz,Thales Sales Almeida,João Guilherme Alves Santos,Giovana Kerche Bonás
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Encoder-only transformer models remain essential for production NLP pipelines. We introduce moBERTo, a Portuguese adaptation of ModernBERT obtained through continued pretraining of the ModernBERT-base checkpoint on 60 billion tokens (5 epochs over a 12-billion-token corpus curated from FineWeb2 and filtered with educational and STEM classifiers). We preserve the original architecture, including rotary positional embeddings, alternating local-global attention, flash attention, and unpadding. We evaluate moBERTo across information retrieval (including long-context retrieval at up to 8,192 tokens), document classification, named entity recognition, and natural language understanding. Our best variant, which combines a Portuguese tokenizer with subword-matching embedding transfer and long-context post-training, achieves the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks and the best results on PLUE-PT. Through ablation studies, we show that (i) continued pretraining is strongly preferable to training from scratch, particularly for preserving long-context capabilities; (ii) tokenizer adaptation improves token-level tasks but degrades long-context retrieval; (iii) a dedicated long-context post-training phase at 8,192 tokens further improves reranking and NER; and (iv) encoder-only architectures remain competitive with larger decoder-only alternatives for discriminative tasks. We publicly release the model weights at this https URL and training data at this https URL on Hugging Face.

[NLP-68] Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

【速读】: 该论文旨在解决大语言模型在推理过程中难以高效生成简洁回答的问题,尤其针对在采用组相对策略优化(Group Relative Policy Optimization, GRPO)时引入长度惩罚奖励以减少冗余输出所引发的奖励坍塌(reward collapse)问题。其核心挑战在于:GRPO中的组归一化机制在错误答案持续遭受长度惩罚时会产生分歧性优势,导致模型优化方向失稳,进而严重损害推理能力。解决方案的关键在于提出一种自适应仅对正确答案施加效率奖励的方法——ACOER(Adaptive Correct-Only Efficiency Reward)。ACOER通过将简洁性奖励严格限定于正确完成的响应,消除结构性惩罚循环;同时结合动态预算归一化与控制回路中的惩罚调节机制,有效防止由随机过压缩引发的统计性坍塌。实验结果表明,ACOER在多个数学推理基准上不仅显著降低超过60%的生成令牌数,还提升了整体准确性,实现了高效性感知优化的稳定性突破。

链接: https://arxiv.org/abs/2606.22716
作者: Jungseob Lee,Seungyoon Lee,Seongtae Hong,Minhyuk Kim,Chanjun Park,Heuiseok Lim
机构: Korea University (高丽大学); Soongsil University (松林大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 7 tables. Code: this https URL

点击查看摘要

Abstract:Training large language models to reason efficiently is a critical challenge. While integrating length-penalizing rewards into Group Relative Policy Optimization (GRPO) aims to reduce verbosity, it frequently triggers reward collapse, severely degrading reasoning capabilities. Through a systematic evaluation of various reward configurations, we identify the root mechanism: GRPO’s group normalization creates divergent advantages when incorrect answers receive continuous length penalties. Consequently, methods penalizing the length of incorrect answers are structurally prone to collapse under sustained optimization. Furthermore, restricting penalties exclusively to correct answers avoids this primary failure, but leaves the model susceptible to a stochastic collapse driven by response over-compression. To robustly prevent both failure modes, we propose ACOER (Adaptive Correct-Only Efficiency Reward). ACOER eliminates the structural penalty loop by isolating brevity bonuses to correct completions and prevents stochastic compression via dynamic budget normalization and control-loop penalty adjustments. Evaluated across diverse mathematical reasoning benchmarks, ACOER improves overall accuracy compared to the base model while reducing token generation by over 60%, establishing a fundamentally stable approach for efficiency-aware optimization.

[NLP-69] Black-Box Forensics for Conversational LLM Agents

【速读】: 该论文旨在解决生成式 AI(Generative AI)滥用问题,特别是针对由大型语言模型(LLM)驱动的诈骗活动缺乏可追溯性与问责机制的难题。核心问题是:在无法访问模型参数或系统提示(system prompt)的情况下,如何对匿名部署的对话式 LLM 代理进行溯源(attribution)和关联(fingerprinting)。解决方案的关键在于提出一种基于黑盒取证(black-box forensics)的双阶段方法:首先,通过少量非对抗性对话即可实现对基础模型的高精度识别(准确率达98%);其次,针对动态变化且未知的系统提示,设计了一种跨编码器(cross-encoder)指纹检测方法,能够在完全未见过的系统提示上实现较高的 AUC(0.768)与 F1(0.703),并通过聚合50轮交互对话将 AUC 提升至0.943,从而实现对未知系统提示的鲁棒指纹识别。该方法突破了传统依赖大量训练数据的局限,为追踪 AI 诈骗链条、识别犯罪网络提供了可行的技术路径。

链接: https://arxiv.org/abs/2606.22698
作者: Isadora White,Yasaman Jafari,Taylor Berg-Kirkpatrick
机构: University of California, San Diego(加州大学圣地亚哥分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLM-powered scams proliferate, black-box forensics for conversational LLM agents offers a path to accountability for systems hidden behind anonymous endpoints. Identifying the base model behind a chatbot endpoint (attribution), without model parameter access or knowledge of the hidden system prompt, would let investigators trace AI-enabled scams back to the providers whose models power them. Detecting when two endpoints run the exact same system prompt (fingerprinting), even one novel and unseen, would link individual scams into criminal networks and expose silent API changes. We conduct an empirical investigation of both capabilities. Our attribution classifiers identify the base model behind an agent with 98% accuracy from a few turns of non-adversarial conversation. Attribution of system prompts, while possible, requires retraining on a large amount of data for each prompt; system prompts in the wild are unbounded and ever-changing, making this approach costly. To tackle this more open-ended setting, our cross-encoder fingerprinting method achieves an AUC of 0.768 and an F1 of 0.703 on entirely unseen system prompts, and aggregating 50 interaction conversations from each target agent boosts AUC to 0.943. Conversational agents with unseen system prompts can thus be fingerprinted with robust accuracy from a few turns of ordinary conversation.

[NLP-70] Only Ask What You Dont Know: Grounded Delta Planning for Efficient Multi-step RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在多跳问答任务中面临的两大核心挑战:一是迭代式检索过程中错误信息的传播,二是过度生成推理步骤导致计算成本上升但准确率未显著提升。其解决方案的关键在于提出一种基于计划的框架——有根基的增量规划RAG(Grounded Delta Planning RAG, GDP-RAG),通过三个简洁的设计原则实现高效精准的推理:(1)先进行初步检索以在执行前对计划进行锚定;(2)采用基于差距的计划提示,仅要求补充缺失信息;(3)引入骨架式推理轨迹,将每个子查询与一个承载初步检索证据的Thought相匹配,并贯穿至最终答案。该方法聚焦于未解决的信息缺口,从而生成简洁且可靠的推理路径,在HotpotQA、2WikiMultiHopQA和MuSiQue等多个基准上实现了60.63%的最高准确率,同时保持每轮代价仅为0.51,分别比PAR-RAG(0.65)和KnowTrace(1.57)降低22%和68%,且无其他方法能在同时实现更高准确率与更低计算成本。

链接: https://arxiv.org/abs/2606.22681
作者: Wei-Chieh Chou,Xuanjun Chen,Jian-Ren Lin,Claire Lin,Hung-yi Lee,Jyh-Shing Roger Jang
机构: National Taiwan University (国立台湾大学); NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE) (国立台湾大学人工智能研究中心卓越中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to COLM 2026

点击查看摘要

Abstract:Multi-hop question answering remains challenging for Retrieval-Augmented Generation (RAG) because existing approaches either propagate errors across iterative retrieval rounds or over-generate reasoning steps, increasing cost without improving accuracy. We propose Grounded Delta Planning RAG (GDP-RAG), a plan-based framework that targets only the information delta based on three simple design choices: (1) preliminary retrieval to ground planning before execution, (2) a gap-conditioned planning prompt that asks only for missing information, and (3) a skeletal trajectory that pairs each subquery with a Thought capturing evidence from preliminary retrieval and carrying it through to the final answer. GDP-RAG focuses computation on unresolved gaps, yielding concise, reliable reasoning trajectories. Extensive experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue show that GDP-RAG achieves the highest accuracy (60.63%) among all compared systems while maintaining a cost-of-pass of 0.51, 22% lower than PAR-RAG (0.65) and 68% lower than KnowTrace (1.57), with no method achieving both higher accuracy and lower cost.

[NLP-71] Orthogonal Representation Editing: Decoupling Semantic Entanglement in Batch Knowledge Editing of LLM s ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在批量知识编辑(batch knowledge editing)过程中性能退化的问题。现有方法受限于语义表示纠缠(semantic representation entanglement),如概念重叠与共享句法模式,在表示空间中累积干扰,导致编辑精度下降。为此,本文提出正交表示编辑(Orthogonal Representation Editing, ORE),通过在模型隐藏表示空间中构建通用语义子空间,并对编辑向量施加正交约束,有效解耦语义纠缠,提升编辑的精确性。此外,引入门控非线性表示头(gated non-linear representation head),实现编辑位置的自适应学习和知识注入的精准控制。实验结果表明,ORE在多种跨语言知识编辑场景中均显著优于现有方法,展现出更强的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2606.22627
作者: Wenhao Yu,Zhicong Lu,Bo Lv,Fangyin Ma,Kaiwen Wei,Shihao Yang,Nayu Liu
机构: Tianjin University (天津大学); Kexin Technology (科信技术); University of Chinese Academy of Sciences (中国科学院大学); Tencent Hunyuan (腾讯混元); Chongqing University (重庆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Knowledge editing aims to efficiently update factual information in Large Language Models (LLMs) without full retraining. However, existing methods still suffer from performance degradation in batch knowledge editing. We identify that semantic representation entanglement, such as overlapping concepts and shared syntactic patterns, accumulates interference in the representation space and reduces editing precision. To bridge this gap, in this paper, we propose Orthogonal Representation Editing (ORE), which performs edits in the hidden representation space of LLMs by constructing a general semantic subspace and enforcing orthogonal constraints on edit vectors, effectively decoupling semantic entanglement. Furthermore, we introduce a gated non-linear representation head to enable adaptive learning of editing locations and precise control over knowledge injection. Extensive experiments show that ORE outperforms existing methods and achieves superior performance in cross-lingual knowledge editing scenarios. We release our code at this https URL.

[NLP-72] Automated sign detection across the Electronic Babylonian Library: A large-scale dataset and end-to-end cuneiform OCR pipeline

【速读】: 该论文旨在解决古楔形文字泥板(cuneiform tablets)自动解读中因人工分析成本极高而导致的标注数据稀缺与可扩展性不足的问题。目前虽已发掘约50万块泥板,但仅有极小部分经由亚述学家完成解析。为突破这一瓶颈,研究提出基于最大规模已标注楔形符号数据集的检测框架,采用基于可变形检测变压器(Deformable Detection Transformer, DETR)的物体检测模型,在173类与106类两种类别粒度下进行评估。其解决方案的关键在于构建一个端到端的系统流程:通过自动化的泥板区域提取、启发式行段分组以及基于n-gram的文本相似性评估,有效连接视觉符号检测与文本结构建模,显著提升了在COCO风格检测指标上的性能,相较先前方法提升达28%-37%。该方法在电子巴比伦图书馆(eBL)语料库的87,668片泥板碎片上进行了推理,生成近290万次符号检测结果。尽管系统不依赖语言先验知识,对泥板破损和版面布局变化仍具敏感性,但其具备良好的可扩展性与可解释性,为大规模楔形文字文献分析提供了坚实基础,并支持未来与多模态及语言模型的融合。

链接: https://arxiv.org/abs/2606.22608
作者: Wentao Che,Esteban Garcés Arias,Asim Niaz,Andreas Bender,Enrique Jiménez
机构: LMU Munich (慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Learning to read cuneiform tablets is an extremely demanding task; consequently, of the roughly half million excavated tablets, only a small fraction has been analysed by Assyriologists. Computer vision offers a promising avenue for decipherment but requires large, densely annotated datasets. To address this limitation, the largest annotated cuneiform sign dataset to date is used, and a Deformable Detection Transformer (DETR)-based object detection model is evaluated under two class granularities of 173 and 106 classes. The proposed system integrates automatic tablet-side extraction, heuristic line grouping, and n-gram-based textual similarity evaluation to bridge visual sign detection and textual structure, and achieves consistent improvements of up to 28-37% over prior work on COCO-style detection metrics. At inference, the method is applied to 87,668 tablet fragments from the Electronic Babylonian Library (eBL) corpus, producing nearly 2.9 million sign detections. Although the approach operates without linguistic priors and remains sensitive to tablet damage and layout variability, it provides a scalable and interpretable foundation for corpus-wide cuneiform analysis and supports future integration with multimodal and linguistic modelling frameworks.

[NLP-73] Sub-Billion Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLM s on General and Literary Relation Extraction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在关系抽取(Relation Extraction, RE)任务中因计算资源需求高及对专有API依赖而难以在资源受限或隐私敏感场景下部署的问题。其核心解决方案在于探索小型语言模型(Small Language Models, SLMs)在通用领域与文学文本中的潜力,通过针对性的任务适配(task adaptation)实现高性能关系抽取。关键发现在于:在有任务特定数据的情况下,经过微调的4比特量化小型模型(如Qwen2.5-0.5B)可在单个消费级GPU上部署,并在通用领域关系抽取任务中达到0.83的正类微平均F1,显著优于零样本调用的前沿大模型(如GPT-5.4,F1=0.69);在文学文本关系抽取中,优化后的SLMs同样超越了零样本的LLMs。研究表明,性能提升主要源于任务适配而非生成式解码机制本身,且监督微调相比领域自适应预训练并无明显优势,说明高效、轻量化的任务定制化策略是实现高性能、低资源、高隐私保障关系抽取的关键。

链接: https://arxiv.org/abs/2606.22606
作者: Despina Christou,Grigorios Tsoumakas
机构: Aristotle University of Thessaloniki (亚里士多德大学); Archimedes, Athena Research Center (阿基米德,雅典娜研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 41 pages, 3 figures, 25 tables

点击查看摘要

Abstract:Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on proprietary APIs limit deployment in resource-constrained or privacy-sensitive settings. We investigate how far small language models (SLMs) can close this gap across general-domain and literary text. We evaluate five models from 360M to 3B parameters under three domain-composition regimes and two prompt-conditioned tuning styles (30 configurations), comparing them with zero-shot frontier LLMs and a discriminative RoBERTa baseline. Across nine benchmarks, the best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves a general-domain positive-class micro-F1 of 0.83, versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 evaluated zero-shot. This does not imply that SLMs are intrinsically stronger; rather, targeted task adaptation enables 4-bit models deployable on a single consumer GPU to outperform general-purpose frontier systems under this protocol. An in-domain RoBERTa baseline also exceeds both frontier models, indicating that the gain stems from task adaptation rather than generative decoding. On literary RE, tuned SLMs reach 0.92 on the human-annotated Biographical benchmark versus 0.83 for GPT-5.4, and 0.833 versus 0.578 on the two-benchmark literary average. A targeted domain-adaptive pretraining case study yields no practically meaningful gain over supervised fine-tuning, while the cleanest within-family scale comparison shows only marginal improvement. These results show that, when task-specific data are available, compact task-adapted models can provide accurate, private, and hardware-efficient RE.

[NLP-74] Context-Aware Distillation and Ablation for Text2DSL

【速读】: 该论文旨在解决自然语言到领域特定语言(DSL)代码自动生成中存在的生成质量不可靠与运行时失效问题,尤其聚焦于Polkit规则生成任务中生成代码的语法正确性与实际执行可行性。其核心挑战在于现有基于提示(prompt-only)的合成数据生成方法难以保证生成结果在抽象语法树(AST)层面的有效性及运行时兼容性。解决方案的关键在于引入一种上下文感知的蒸馏机制:以大型语言模型(DeepSeek-V4-Flash)为教师模型,在明确结构化的上下文(包括BNF语法、API规范和封闭标识符词汇表)约束下进行代码生成,并通过双层验证管道——基于esprima的AST有效性验证与基于生产环境polkitd守护进程及pkcheck客户端的运行时接受度测试——对生成结果进行严格筛选。该方法将验证后的PolkitBench语料库规模从4,204提升至10,073对自然语言到Polkit规则的映射,实现100.0%的AST有效率和99.7%的运行时通过率。进一步的因子消融实验表明,结构化上下文不仅是表面优化,而是支撑生成质量的核心机制;其中,词汇表对语义质量贡献最大,而API与BNF分别对结构有效性提升最为显著,凸显了多维度上下文协同作用的重要性。

链接: https://arxiv.org/abs/2606.22578
作者: Alexander V. Kozachok,Alexander M. Nazimov,Shamil G. Magomedov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures, 6 tables

点击查看摘要

Abstract:We extend our prior work on Text2DSL automatic generation of domain-specific language (DSL) code from natural language descriptions along two complementary axes. First, we replace prompt-only synthetic generation with context-aware distillation, in which a teacher large language model (DeepSeek-V4-Flash) operates under an explicitly defined structured context comprising a BNF grammar, an API specification, and a closed identifier vocabulary; the resulting corpus is verified by a two-tier pipeline combining AST validation through esprima and runtime acceptance through the production polkitd daemon and the pkcheck client. This scales the verified PolkitBench corpus from 4,204 to 10,073 natural-language-to-Polkit-rule pairs at 100.0% AST validity and 99.7% runtime pass rate. Second, we conduct the per-component factorial ablation of structured context that was identified as future work in the precursor study: eight conditions C0-C7 are evaluated on GigaChat-10B-A1.8B with the new corpus. Three findings emerge. (i) The new harder corpus collapses the baseline mode (Syntax Valid 97.6% - 58.5%, Combined Score 0.482 - 0.252), whereas the context-enhanced mode degrades only marginally (Syntax 98.6% - 97.4%, Combined 0.801 - 0.750), confirming that structured context is not a cosmetic improvement but a load-bearing mechanism. (ii) The best absolute condition is the full context C7 across all metrics, while the strongest partial conditions (C5 = BNF + Vocabulary, C6 = API + Vocabulary) both contain the vocabulary. (iii) A Shapley-style decomposition assigns the largest semantic-quality effect to the vocabulary (Combined +0.198), the largest structural-validity effects to API (+24.7 pp) and BNF (+22.3 pp).

[NLP-75] What are Key Factors for Updates in RL for LLM Reasoning ?

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)框架中,现有方法因依赖启发式直觉而导致算法选择分歧甚至矛盾,却仍能取得经验性提升的问题。其核心挑战在于:不同算法在离策略程度(off-policy degree)上的差异,会显著影响重要性采样比率(importance sampling ratio)的分布及其裁剪行为,进而决定哪些词元(token)主导优化更新过程,从而影响模型推理能力的提升效果。论文的关键突破在于提出将梯度期望(gradient expectation)作为控制更新动态的核心量,并系统分析了词元概率、优势值(advantage)与重要性采样比率三者之间的相互作用机制。基于此理论洞察,作者提出了自适应裁剪策略优化(Adaptive Clip Policy Optimization, ACPO),通过根据各词元组的实际重要性采样比率方差动态调整裁剪边界,实现更精准、稳定的更新。在3B和7B参数规模的语言模型上,针对数学求解、表格问答和逻辑谜题等多样化推理任务的实验表明,ACPO显著优于DAPO和CISPO等强基线方法,验证了基于理论分析驱动的算法设计能够带来更鲁棒且高效的RLVR性能。

链接: https://arxiv.org/abs/2606.22570
作者: Peidong Wang,Demi Wang,Xufang Luo,Jiahang Xu,Xiaocui Yang,Shi Feng,Yuqing Yang,Dongsheng Li
机构: Microsoft Research; Northeastern University (东北大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning ability of large language models. However, much of the existing work is guided by heuristic intuition, leading to divergent algorithmic choices, even contradictory ones that nevertheless report empirical gains. To better understand this phenomenon, we conduct a theoretical analysis of RLVR updates. Our study reveals that differences in off-policy degree, determined by the number of gradient steps per rollout, substantially affect the distribution of importance sampling ratios and their clipping behavior, thereby altering which tokens dominate the update. Building on this insight, we characterize gradient expectation as the central quantity governing update dynamics and analyze the roles of token probability, advantage, and importance sampling ratio. Motivated by these findings, we propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries across token groups according to the empirical variance of their importance sampling ratios. Experiments on 3B and 7B models across diverse reasoning benchmarks, spanning mathematical problem solving, tabular QA, and logic puzzles, demonstrate that ACPO outperforms strong baselines such as DAPO and CISPO. These results demonstrate that principled, analysis-driven approaches yield more robust and effective RLVR methods. Code is available in: this https URL

[NLP-76] Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

【速读】: 该论文旨在解决少样本提示学习(few-shot prompt learning)在适应CLIP模型至下游任务时,仅优化类别级提示(class-only prompt optimization)所导致的过拟合基础类别监督信号、进而削弱对未见类别的泛化能力的问题。其核心解决方案是提出一种轻量级正则化框架——概念约束提示学习(Concept-Constrained Prompt Learning, CCPL),其关键在于通过冻结的概念级文本原型(frozen concept-level text prototypes)对可学习的类别提示进行锚定,从而在不更新CLIP编码器的前提下增强提示的语义一致性与泛化性。具体而言,CCPL通过共享上下文标记生成类别提示,并从类别级概念库中构建固定的概念原型;训练阶段采用文本空间余弦一致性目标,使可学习提示嵌入与固定概念原型保持对齐,同时引入概念丢弃(concept dropout)以防止模型过度依赖预设概念列表;推理阶段则通过可控融合权重α,将类别提示与概念原型的输出概率进行加权融合。实验表明,在相同自动构造的回退划分下,相较于CoOp,CCPL在DTD数据集上提升基类到新类的调和均值0.6,在EuroSAT上提升2.9,而在OxfordPets上保持近似中性;消融实验进一步验证了文本空间概念正则化的持续有效性,且最优的概念引导推理强度具有数据集与协议敏感性,揭示出当概念原型与数据集语义自然契合时,该方法效果最佳,而细粒度分类仍是当前的边界条件。

链接: https://arxiv.org/abs/2606.22567
作者: Na Sang,Ding Ma,Rui Sang,Yuxuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Few-shot prompt learning is an effective strategy for adapting CLIP to downstream tasks, but class-only prompt optimization can overfit base-class supervision and weaken transfer to unseen classes. We propose Concept-Constrained Prompt Learning (CCPL), a lightweight regularization framework that anchors learnable class prompts to frozen concept-level text prototypes without updating CLIP encoders. CCPL learns a set of shared context tokens, instantiates class prompts by appending class names, and constructs frozen concept prototypes from a class-level concept bank. During training, a text-space cosine consistency objective aligns learnable class-prompt embeddings with frozen concept prototypes; concept dropout provides additional regularization against over-reliance on fixed concept lists. At inference, CCPL optionally fuses class-prompt logits with concept-prototype logits using a controllable ensemble weight alpha. Our default configuration uses text-space concept regularization lambda = 0.5, concept dropout p = 0.3 and weak concept-guided fusion (alpha = 0.1), with no KL-based prediction consistency term. Experiments under identical automatically-generated fallback splits show that CCPL improves the base-to-new harmonic mean on DTD (+0.6) and EuroSAT (+2.9) compared with CoOp, while remaining near-neutral on OxfordPets (-0.1). Ablations indicate that text-space concept regularization is consistently beneficial, while the best concept-guided inference strength is dataset- and protocol-sensitive. These results suggest concept constraints are most effective when concept prototypes align naturally with dataset semantics, and identify fine-grained categories as a current boundary condition. The code is released at: this https URL.

[NLP-77] Look Light Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do ACL2026

【速读】: 该论文旨在系统探究多模态链式思维(Chain-of-Thought, CoT)在跨感知与推理任务中的实际效能及其局限性,核心问题是:多模态CoT能够实现哪些能力,其在何处失效以及原因何在。解决方案的关键在于通过对比14种非推理模型与8种推理模型,在12个涵盖感知与推理类别的多模态任务上进行实证评估,揭示CoT在不同任务场景下的差异化表现。研究发现,CoT并非通用增强手段,需根据任务特性选择性使用;其在数学、科学及多图像推理等抽象推理任务中表现优异,但在视觉感知类任务中反而导致视觉定位与目标计数性能下降;现有开源多模态推理模型虽引入CoT,但整体提升有限,可能因过度聚焦数学推理而牺牲了更广泛的多模态理解能力;尤为关键的是,当前模型在多模态CoT过程中存在“看轻思重”(Look Light, Think Heavy)现象,即语言层面的反思不断增强,而视觉层面的深层内省却持续减弱,表明模型难以在推理全程保持有效的视觉表征反思。因此,解决方案的核心突破点在于如何构建能平衡语言与视觉双重反思能力的多模态推理框架,以克服当前视觉感知深度不足的瓶颈。

链接: https://arxiv.org/abs/2606.22565
作者: Zhuoran Jin,Kejian Zhu,Hongbang Yuan,Yupu Hao,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2026

点击查看摘要

Abstract:Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.

[NLP-78] raining-Free Semantic Correction for Autoregressive Visual Models

【速读】: 该论文旨在解决自回归视觉模型(Autoregressive Visual Models, AVM)在图像与视频生成过程中,由于将生成过程分解为具有不同粒度的离散尺度而导致语义错误难以识别与修正的问题。此类错误在生成流程中持续累积,严重影响最终输出质量。现有提升AVM性能的方法分为基于训练和无需训练两类:前者虽有效但计算成本高昂,而后者虽避免了额外训练开销,却忽视了中间生成状态的信息,导致语义错误无法被诊断并持续积累。为此,本文提出Gazer框架,采用无需训练的范式,将多模态大语言模型(Multimodal Large Language Model, MLLM)的反馈引入AVM采样循环,实现生成过程中的语义纠错。其核心机制包含两个协同阶段:反射诊断(Reflective Diagnosis)阶段通过分析中间生成状态识别语义偏差,语义修正(Semantic Correction)阶段则回溯并调整生成轨迹,使其重新对齐目标提示(prompt)。在组合图像与视频基准测试上的实验表明,Gazer在不增加训练成本的前提下,显著提升了多种AVM的语义一致性与组合准确性。

链接: https://arxiv.org/abs/2606.22550
作者: Junhao Chen,Chanyu Zhu,Zheqi Lv,Keting Yin,Shengyu Zhang
机构: Zhejiang University (浙江大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Autoregressive visual models (AVMs) based on next-scale prediction have emerged as a prominent paradigm for image and video synthesis. However, decomposing the generation process into discrete scales with varying granularities in AVM makes semantic errors difficult to identify and correct, thereby undermining the quality of the final output. Prior efforts to enhance AVM can be categorized into training-based and training-free approaches. Although training-based efforts to enhance AVM generation quality come at substantial computational cost, existing training-free methods neglect intermediate generation states, leaving semantic errors undiagnosed and allowing them to accumulate into the final output. In this paper, we focus on training-free paradigms and propose Gazer, a framework that integrates multimodal large language model feedback into the AVM sampling loop for in-generation semantic correction. Concretely, Gazer operates via two cooperating stages: the Reflective Diagnosis stage diagnoses semantic errors from intermediate states, while the Semantic Correction stage rewinds and rectifies the generation trajectory to realign with the target prompt. Experiments on compositional image and video benchmarks demonstrate that Gazer improves semantic alignment and compositional accuracy across multiple AVMs without additional training.

[NLP-79] Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding

【速读】: 该论文旨在解决大语言模型(LLM)在开放式生成任务中普遍存在的“似然陷阱”问题,即生成文本出现重复退化与词汇贫乏的现象,导致机器生成内容与人类写作在语言风格和多样性上存在显著差异。现有方法如后处理尾部截断(如Top-p、Min-p)虽能规避不可靠尾部采样,但易过度采样未校准的头部词汇,偏离人类词汇偏好;而固定的标量重复惩罚则忽视了推理过程中对数概率尺度的动态变化,可能破坏语义连贯性。针对上述局限,本文提出无需训练的预解码干预方法——方差校准调制(Variance-Calibrated Modulation, VCM),其核心在于通过两种动态机制重塑截断前的概率分布:(1)基于PMI的上下文感知聚光灯机制,抑制全局停用词并增强由上下文激发的关键词汇;(2)自适应自去偏机制,利用实时对数概率标准差实现尺度不变的惩罚,以应对不同推理步骤间的尺度差异。实验表明,VCM在开放式生成、事实问答及数学推理等任务中均能有效缓解似然陷阱,在计算开销极低的前提下可无缝集成于现有解码策略,显著提升生成多样性、连贯性,并在高温度解码下尤其改善推理准确性。

链接: https://arxiv.org/abs/2606.22511
作者: Yuanhao Ding,Meimingwei Li,Esteban Garces Arias,Matthias Aßenmacher,Christian Heumann,Chongsheng Zhang
机构: Henan University(河南大学); LMU Munich(慕尼黑大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Under Review

点击查看摘要

Abstract:In open-ended generation, LLMs frequently fall into the “likelihood trap”, marked by repetitive degeneration and vocabulary dullness, creating a discrepancy between machine-generated and human-written text. While post-hoc tail truncation (e.g., Top- p , Min- p ) avoids sampling from the unreliable tail, it can over-sample from the uncalibrated head and misalign generation with human lexical preferences; fixed scalar repetition penalties likewise ignore variation in logit scale across inference steps, potentially disrupting semantic coherence. To address both limitations, we propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding intervention that reshapes the probability distribution before truncation through two dynamic mechanisms: (1) Contextual Searchlight via PMI, which suppresses global stopwords while elevating context-evoked tokens, and (2) Adaptive Self-Debiasing, which uses real-time logit standard deviation for scale-invariant penalization. Across open-ended generation, factual QA, and mathematical reasoning, VCM consistently mitigates the likelihood trap. With negligible computational overhead, VCM integrates with existing decoding strategies, improving diversity, coherence, and, particularly at higher decoding temperatures, reasoning accuracy.

[NLP-80] VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows KR2026

【速读】: 该论文旨在解决传统业务流程管理(Business Process Management, BPM)系统在动态环境下的适应性不足,以及基于大语言模型(Large Language Models, LLMs)的代理系统在可解释性、可靠性与大规模数据处理时的可扩展性瓶颈之间的矛盾。其核心问题是:如何在保持决策过程可审计、可复现的同时,实现对复杂、动态任务的灵活推理与高效执行。解决方案的关键在于提出VADAOrchestra——一种神经符号框架,通过将工作流建模为演进式推理过程,采用混合架构实现高阶编排与符号化推理的解耦。具体而言,以LLM作为编排器动态生成并调整工作流,将其转化为Datalog+/-逻辑程序,其中谓词表示工具调用,规则涵盖预定义领域依赖及按需合成的逻辑结构;所有逻辑推理由先进的Datalog+/-符号引擎执行。该机制不仅保障了推理过程的可验证性与可追溯性,还通过分离高层控制与底层符号计算,有效缓解了规模限制,支持在大规模数据集上进行精准、可解释的推理。实证结果表明,相较于标准代理架构,VADAOrchestra在金融场景中展现出更高的忠实度、可扩展性与可解释性。

链接: https://arxiv.org/abs/2606.22485
作者: Teodoro Baldazzi,Luigi Bellomarini,Andrea Coletta,Michela Iezzi,Carsten Maple,Alessandro Pesare,Emanuel Sallinger
机构: TU Wien (维也纳大学); Banca d’Italia (意大利银行); University of Warwick (华威大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Logic in Computer Science (cs.LO)
备注: Accepted at KR 2026

点击查看摘要

Abstract:Decision-making in real-world settings rarely follows a fixed script. Instead, it unfolds as a dynamic reasoning process in which the appropriate course of action evolves as new context and data become available. Traditional Business Process Management systems provide rigor, determinism, and auditability, yet they generally struggle to adapt their execution at runtime. Conversely, agentic systems based on Large Language Models (LLMs) bring flexibility to decision-making, but they are inherently opaque, often unreliable, and suffer from significant scalability constraints when operating over large datasets. To combine these complementary paradigms, we introduce VADAOrchestra, a neurosymbolic framework that models complex workflows as evolving reasoning processes. The framework adopts a hybrid approach: given a user query and a collection of data sources, an LLM-based orchestrator incrementally plans and adapts the workflow. This is encoded as a logic program in a fragment of Datalog+/- where predicates correspond to tool invocations and rules represent both predefined domain dependencies and logic constructs synthesized on demand to manipulate intermediate results. All logical inference tasks are then executed by a state-of-the-art Datalog+/- symbolic engine. This approach provides a verifiable reasoning trace, supporting the auditability and reproducibility of the entire process. Furthermore, by decoupling high-level orchestration from symbolic inference, it addresses scalability concerns, enabling complex reasoning over large datasets through targeted data querying. We evaluate VADAOrchestra on real-world financial use cases, demonstrating faithfulness, scalability, and explainability compared to standard agentic architectures.

[NLP-81] ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models

【速读】: 该论文旨在解决多语言预训练模型(如mBERT)在处理形态不一致语言(如罗马乌尔都语)时因拼写变异导致的子词碎片化问题,进而影响模型性能。其核心挑战在于:罗马乌尔都语存在严重的拼写变体,平均每个词元产生1.50个子词,造成词汇表扩展过程中嵌入空间的不稳定。为此,论文提出Romeva(Roman Urdu Embedding-preserving Vocabulary Adaptation)方法,其关键创新在于结合子词平均初始化与基于主成分分析(PCA)引导的锚点损失(anchor loss),以在词汇表扩展过程中稳定嵌入表示。实验基于包含36,130条评论的罗马乌尔都语语料库,对比了朴素微调、子词感知微调和Romeva三种策略。结果表明,尽管Romeva最有效地保持了预训练嵌入空间的稳定性,但朴素微调在下游情感分类任务中表现最佳。这一发现揭示了嵌入稳定性与下游性能之间的脱节,提示在形态不一致语言中,更强的适应性可能优于严格的嵌入保留。

链接: https://arxiv.org/abs/2606.22478
作者: Mahnoor Khan,Afsheen Asif,Milhan Afzal Khan,Seemab Latif,Mehwish Fatima
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual Language Models like mBERT are widely used for low-resource NLP, yet their adaptation to morphologically inconsistent languages such as Roman Urdu remains underexplored. Roman Urdu spelling variation causes severe sub-word fragmentation, averaging 1.50 sub-words per token. We propose \textitROMEVA (Roman Urdu Embedding-preserving Vocabulary Adaptation), which combines sub-word-average initialization and a PCA-guided anchor loss to stabilize embeddings during vocabulary expansion. Using a 36,130-comment Roman Urdu corpus, we add 500 highly fragmented tokens to mBERT and compare naive fine-tuning, sub-word-aware fine-tuning, and \textitROMEVA. While \textitROMEVA most effectively preserves the pretrained embedding space, naive fine-tuning achieves the strongest downstream sentiment classification performance. These findings reveal a disconnect between embedding stability and downstream performance, suggesting that stronger adaptation may be preferable to strict embedding preservation in morphologically inconsistent languages.

[NLP-82] Not All Claims Are Equally Risky: FACTOR for Adaptive Verification in Factual Long-Form Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成长篇文本时频繁引入未经证实的事实性错误(即幻觉,hallucination)的问题。现有验证技术虽能通过外部证据约束生成内容以提升事实准确性,但普遍采用统一的验证策略,未能考虑不同语义主张(claim)间存在的差异化的幻觉风险。针对这一局限,论文提出了一种推理阶段的自适应验证框架 FACTOR(FACTuality-Oriented Risk-aware Verification),其核心在于根据每个主张级别的不确定性动态调整验证强度。FACTOR 的关键创新包括:不确定性估计、自适应语言推理验证以及候选文本重排序,从而将验证资源精准分配至高风险主张,实现高效且有针对性的验证。实验在 FactScore 基准上表明,该方法在显著提升生成内容事实性的同时,有效降低了整体验证成本。消融研究进一步揭示,不确定性感知的自适应机制是性能提升的主要驱动力。结果证明,FACTOR 具有良好的有效性与模型无关性,可广泛应用于提升长文本生成的事实可靠性。

链接: https://arxiv.org/abs/2606.22474
作者: Areeba Hassan,Arooj Kausar,Syeda Kisaa Fatima,Gibrail Islam,Mehwish Fatima
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) generate fluent long-form text, however, often add unsupported factual claims. Existing verification techniques improve factuality by grounding generation in external evidence. However, the same verification policy usually applies to all claims despite being differences in hallucination risks. We propose \textitFACTOR (\textitFACTuality-Oriented Risk-aware Verification), an inference-time model that adapts verification criteria according to claim-level uncertainty. FACTOR combines uncertainty estimation, adaptive language inference verification, and candidate re-ranking to allocate verification effort where it is most needed. We evaluate \textitFACTOR on FactScore benchmark showing that adaptive verification improves factuality while reducing verification cost simultaneously. We further perform different ablation studies to identify the primary driver of these gains. Our results show the effective and model-agnostic performance of \textitFACTOR for improving factuality in long-form generation.

[NLP-83] Interleaved Speech Language Models Latently Work In Text

【速读】: 该论文旨在解决生成式语音语言模型(Speech Language Models, SLMs)中语音与文本模态在模型隐空间内交互机制不明确的问题。现有主流方法采用语音-文本交错训练范式,即在训练序列中混合语音与文本标记,以增强模型的纯语音处理能力,但其内部如何实现跨模态信息融合仍缺乏深入理解。本文通过引入“对数几率透镜”(logit lens)分析不同架构与规模的交错型语音-文本语言模型,揭示了模型在推理过程中存在一个隐式的转录阶段:尽管模型未显式训练于语音识别任务,但语音所对应的文字标记已在中间层可被解码,且在高达77%的数据中成为最可能的候选词之一。随后,模型进入文本空间预测下一个词,再转换回语音域。研究进一步探讨了交错数据设计及从文本语言模型(Text LM)初始化对这一行为的诱导作用,并发现该机制与模型的口语知识表现密切相关。本研究的关键在于揭示了语音-文本交错模型内部存在的“隐式转录-文本预测-再映射”动态过程,为理解多模态语言模型的内在工作机制提供了新视角,也为未来优化语音语言模型的设计与训练策略提供了理论依据。

链接: https://arxiv.org/abs/2606.22473
作者: Talia Sternberg,Gallil Maimon,Yossi Adi
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint. 23 pages, 20 figures, 5 tables

点击查看摘要

Abstract:Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.

[NLP-84] CASPER in the Machine: Insights into Character Variety in LLM -Generated Stories ACL

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在虚构类文本创作中生成角色与人类写作角色之间存在何种差异的问题。其核心关注点在于角色的复杂性与叙事表现,而非简单的属性描述。研究借鉴叙事学(narratology)理论,从八个精细维度(如风格化、整体性等)对角色进行系统分析,这些维度超越了基础特征,深入考察角色在故事中的建构方式与呈现效果。解决方案的关键在于构建一个自动化框架,用于识别并分类生成式模型与人类作者所创作故事中的角色类型,并通过对比分析揭示二者在角色多样性、复杂性及叙事功能上的异同。研究结果表明,尽管存在若干相似之处,但生成式模型在角色的深度、连贯性与多样性方面仍表现出系统性局限,为理解大语言模型(LLM)在创造性写作中的能力边界提供了实证依据。

链接: https://arxiv.org/abs/2606.22454
作者: Anneliese Brei,Abhisheik Sharma,Nicholas Sanaie,Lu Wang,Snigdha Chaturvedi
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Georgia Institute of Technology (佐治亚理工学院); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of ACL, 2026

点击查看摘要

Abstract:As LLM-generated text is increasingly used, especially in fictional domains, we explore how much LLM-generated stories differ from human-written stories. In this work, we focus on characters. We borrow definitions from narratology to analyze eight intricate dimensions of character, such as stylization and wholeness. These dimensions consider more than just basic characteristics. They assess how characters are portrayed within their stories. After automatically inferring categories of characters within both LLM and human-written stories, we compare and contrast these two sets of stories. We consider the following overarching questions: (1) Do LLMs and human-written stories have similar characters? and (2) Do LLMs generate stories with a variety of characters? Our analysis includes research questions that focus on stories generated by popular LLMs and recently published human-written stories. We describe a number of interesting similarities, differences and key takeaways.

[NLP-85] Words as Difference Makers: How Large Language Models Determine Causal Structure in Text

【速读】: 该论文试图解决的核心问题是:尽管大语言模型(Large Language Models, LLMs)在文本生成与预测方面表现出色,暗示其可能具备对因果与定义性结构的“世界模型”(world model),但现有的主流因果推断框架——如朱迪亚·珀尔(Judea Pearl)的干预主义方法和奈曼-鲁宾(Neyman-Rubin)潜在结果框架——难以解释LLMs如何习得因果结构。论文的关键解决方案在于提出,LLMs采用了一种基于“差异制造逻辑”(difference-making logic)的特定归纳机制,亦称变分归纳(variational induction)。该机制通过在大规模、多情境的文本数据训练中识别词序列中的“差异制造者”(difference-makers)与“无关因素”(indifference-makers),从而隐式构建因果关系。研究进一步分析了模型架构中的关键组件,如词嵌入(token embeddings)与自注意力机制(self-attention),揭示其在支持变分归纳过程中的作用。这种差异制造逻辑本质上与实验科学方法一致,即通过系统性地改变个别条件以判定其对现象的影响,从而推导出因果关系。

链接: https://arxiv.org/abs/2606.22430
作者: Wolfgang Pietsch
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages, 6 figures

点击查看摘要

Abstract:Because large language models (LLMs) are impressively successful in predicting text, it appears that they must have access to a ‘world model’ representing causal and definitional structure. However, the dominant formalisms of modern causal inference – Judea Pearl’s interventionist approach and the Neyman-Rubin potential outcomes framework – struggle to illuminate how LLMs learn causal structure. I resolve this puzzle by arguing that LLMs employ a specific inductive approach based on a difference-making logic – sometimes called variational induction. I demonstrate how central aspects of this logic are realized during training, where LLMs require enormous amounts of text data from a wide range of contexts to identify difference- and indifference-makers within word sequences. Furthermore, I analyze specific architectural features of LLMs – such as token embeddings and self-attention – to determine their roles in variational induction. The difference-making logic of LLMs fundamentally parallels the experimental method, where causal relations are derived by systematically varying individual circumstances to determine their influence on a phenomenon.

[NLP-86] Knowledge-Graph Grounding Helps LLM s Only for Out-of-Training Knowledge: A Controlled Study on Clinical Question Answering

【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)在医疗领域应用中存在的一类核心问题:即前沿大语言模型(LLM)是否真正需要依赖外部知识图谱(Knowledge Graph, KG)进行结构化知识增强,以及在何种条件下知识增强能够带来实际性能提升。研究发现,尽管已有研究表明检索机制可能对高性能模型产生负面影响,但其关键在于知识的“可及性”与“新颖性”——当事实信息属于公开知识图谱(如PrimeKG)中的已知内容时,无论采用朴素三元组检索或基于自然语言到Cypher查询的代理循环,均无法显著提升模型在MedQA等基准测试上的表现;而只有当信息涉及模型训练数据之外的私有或新出现的事实时,基于知识图谱的结构化接地(grounding)才能实现从随机水平到接近100%准确率的跃升。因此,解决方案的关键在于识别并利用模型训练范围之外的知识边界,即只有在处理非训练覆盖的知识时,知识图谱的引入才具有实质性价值,这与原始研究中关于机构数据局限性的警示高度一致。

链接: https://arxiv.org/abs/2606.22419
作者: Madhulatha Mandarapu,Sandeep Kunkunuru
机构: Samyama(萨米亚玛)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: 9 pages. Code: this https URL

点击查看摘要

Abstract:A recent Nature Medicine study reports that general-purpose frontier LLMs outperform specialized retrieval-augmented clinical tools on medical benchmarks, and that retrieval can hurt strong models. We ask the natural follow-up: does structured knowledge-graph (KG) grounding change this, and when does grounding help at all? We contribute two results. First, a reproduction: the study’s headline HealthBench score (~88) is the Consensus variant, not full HealthBench, where frontier models and ideal completions both score ~46-47 under a physician-calibrated grader (agreement 82.5%); we reproduce GPT-5.2 Consensus =90.9 and flag a score-deflating grader bug. Second, a knowledge-boundary result. Using a graph+vector engine (samyama-graph) over the public biomedical KG PrimeKG, neither naive triple retrieval nor an agentic natural-language-to-Cypher loop (82% successful queries) improves MedQA across a weak-to-strong model ladder (all |Delta| = 3.4). On a synthetic counterfactual KG, and on a hybrid benchmark mixing known and novel facts, the identical pipeline lifts out-of-training accuracy from chance to ~100% (+68 to +79) while adding nothing on known facts (a no-LLM arm answers both). Across three regimes (no-knowledge, graph-aided, hybrid), grounding helps only insofar as the decisive fact lies outside the model’s training – public-KG facts are redundant, private and novel data are where it pays – matching the study’s institutional-data caveat.

[NLP-87] Reinforcement learning to improve large language model-based automated code compliance systems

【速读】: 该论文旨在解决基于大语言模型(LLM)的建筑规范自动化合规性检查(Automated Code Compliance, ACC)中生成错误或虚构的可计算机处理规则的问题,其核心挑战在于如何提升生成代码中间表示(如高层代码骨架)的准确性与可靠性。解决方案的关键在于提出一个两阶段框架P4IR:首先通过监督微调(Supervised Fine-Tuning, SFT)将领域知识注入LLM,以增强其对建筑规范的理解;随后采用组相对策略优化(Group Relative Policy Optimization, GRPO)直接优化生成结果在结构与语义层面的准确性,从而显著降低树编辑距离(tree edit distance)和词元级莱文斯坦距离(token-level Levenshtein distance)。实验表明,该方法相较SFT基线分别降低了23.8%和38.6%的误差,并在零样本设置下超越了Claude Opus、Sonnet 4.5、GPT-5.2、Qwen-3-Max及GLM-4.7等主流大模型的表现,同时在GRPO阶段实现了假阳性率的统计学上显著降低。通过结合SFT与GRPO以实现对特定领域目标的端到端优化,该方法为构建更准确、可信的LLM驱动的建筑合规性检查系统提供了有效路径。

链接: https://arxiv.org/abs/2606.22402
作者: Jack Wei Lun Shi,Minghao Dang,Wawan Solihin,Leong Hien Poh,Justin K.W. Yeoh
机构: National University of Singapore (新加坡国立大学); Harbin Institute of Technology (哈尔滨工业大学); NovaCITYNETS
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 12 figures, 1 table

点击查看摘要

Abstract:Large language model (LLM)-based approaches for automated code compliance (ACC) of building regulations are prone to generating incorrect and hallucinated computer-processable rules. This paper introduces P4IR, a two-stage framework that uses supervised fine-tuning (SFT) to instill domain knowledge in an LLM, followed by Group Relative Policy Optimization (GRPO) to improve the accuracy of the generated intermediate representations in the form of high-level code skeletons. The framework achieved reductions of up to 23.8% and 38.6% in tree edit distance and token-level Levenshtein distance respectively, relative to the SFT baselines. Comparative analysis demonstrates that this approach in a zero-shot setting outperforms leading LLMs in both code structure and semantics, specifically Claude Opus and Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7, evaluated via few-shot prompting. Additionally, the GRPO stage produced a small yet statistically significant reduction in false positives. By combining SFT with GRPO to optimize directly for domain-specific objectives, this approach offers a path toward more accurate and reliable LLM-based ACC systems.

[NLP-88] PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

【速读】: 该论文旨在解决大语言模型(LLM)代理在大规模工具生态系统中进行长时程规划时面临的挑战,尤其关注在检索受限、工具可见性有限的现实场景下,代理如何发现相关工具、推断隐含子目标并动态适应环境变化。其核心问题在于现有基准测试未能充分评估代理在工具不可见或部分失效情况下的推理与自适应能力。为此,论文提出PlanBench-XL——一个包含327个零售任务和1,665个工具的交互式基准,通过迭代检索可用工具并利用其输出作为中间证据来推进最终目标,以检验代理的规划能力。该基准的关键创新在于引入可选的“阻塞机制”(blocking mechanism),模拟真实世界中工具缺失、失败或产生干扰等不确定性,迫使代理在运行时识别路径中断并调整策略。实验结果表明,即使最先进的模型如GPT-5.4在无阻塞条件下仅达到51.90%准确率,而在最严苛的阻塞条件下骤降至11.36%,凸显了当前代理在缺乏显式错误信号或需走更长替代路径时的脆弱性。因此,解决方案的核心在于构建一个能暴露并诊断代理规划失败的高保真测试平台,并强调未来需发展具备鲁棒性的自适应规划能力以应对复杂、不完美的工具环境。

链接: https://arxiv.org/abs/2606.22388
作者: Jiayu Liu,Qihan Lin,Cheng Qian,Rui Wang,Emre Can Acikgoz,Xiaocheng Yang,Jiateng Liu,Zhenhailong Wang,Xiusi Chen,Heng Ji,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.

[NLP-89] First-Token Broadcasters: Mechanistic Origins of Language Identity and Distributed Robustness in Transformers EMNLP2026

【速读】: 该论文旨在解决多语言大模型在生成过程中出现错误语言输出的问题,尤其是探究其根本原因及为何此类问题难以修复。核心问题是:尽管模型具备多语言能力,但为何会在某些情况下“误生成”非目标语言?解决方案的关键在于提出一种因果干预方法——语言身份头消融(Language Identity Head Ablation, LIHA),通过逐个置零注意力头并测量在2,700个跨语言提示对上的语言切换率(language switch rate, SR),系统性地识别出影响语言身份判断的关键注意力头。研究发现,在GPT-2中存在少数“首词广播头”(first-token broadcaster heads),特别是第6层第1个头(L6H1),其持续关注输入提示的第一个词,并将该词的语言信号传播至整个生成过程,从而主导语言选择。进一步分析揭示,当这些关键头被消融时,模型会以显著的统计学意义(p < 10⁻⁵)进行补偿,且补偿行为呈现方向性与层级性特征:仅在更高层(上层)激活新的注意力头,表明存在前馈级联机制而非全局扩散。通过对同构但训练方式不同的两个模型(Qwen2.5-1.5B-Base 与 Qwen2.5-1.5B-Instruct)进行对比,发现指令微调(instruction tuning)会显著重塑语言身份电路,使其因果影响力高度集中于第一层,尤其体现在L0H5头(SR=0.224,超出均值8.93σ)。这提供了直接的因果证据,表明训练范式决定了语言身份处理的层级分布。此外,对中国语和俄语的扩展实验验证了首词广播机制具有语言脚本特异性,非拉丁文字在模型中同样由第一层处理,与指令微调模型一致。综上,该研究揭示了多语言生成偏差的根本成因——早期层中特定注意力头对语言身份信号的强聚焦,并指出训练策略对这一机制的决定性影响。

链接: https://arxiv.org/abs/2606.22361
作者: Arjun Pillai,Christian Hoang,Anjelo Jann Laroza
机构: Irvington High School; GenAI4E; Mapua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at BlackboxNLP (EMNLP 2026)

点击查看摘要

Abstract:Why do multilingual language models sometimes generate in the wrong language, and why is this so hard to fix? We introduce Language Identity Head Ablation (LIHA), a causal intervention that zeros each attention head individually and measures the resulting language switch rate across a parallel dataset of 2,700 prompt-language pairs spanning seven languages. Applied to GPT-2, LIHA identifies a small set of first-token broadcaster heads - led by L6H1 (switch rate 0.32, 3.23 \sigma above the population mean) - that attend persistently to the first prompt token, propagating its language signal throughout generation. Compensatory redistribution when heads are ablated is statistically significant (p 10^-5 ) and follows a directional, hierarchical pattern: compensation always recruits heads in layers above the ablated head, suggesting a feedforward cascade rather than global diffusion. To probe how training regime shapes these circuits, we apply LIHA to a controlled pair - Qwen2.5-1.5B-Base and Qwen2.5-1.5B-Instruct - identical in architecture and size, differing only in training. The base model is nearly flat (max SR=0.016, 200/336 heads at SR=0.0); the instruct model concentrates causal influence sharply at layer 0, led by L0H5 (SR=0.224, 8.93 \sigma above mean), with all other layers near zero. This controlled comparison provides direct causal evidence that instruction tuning reorganizes language identity circuits toward early-layer localization. Extended experiments with Chinese and Russian confirm that first-token broadcasting is script-specific in GPT-2, with non-Latin languages handled at layer 0 - the same locus as the instruction-tuned model. Code and data will be released upon publication.

[NLP-90] ORBIT: Training-Free Multi-Attribute Behavioral Steering via Orthogonal Subspace Rotation

【速读】: 该论文旨在解决在生成式 AI(Generative AI)助手场景中同时控制多个行为属性时存在的关键挑战:现有无训练(training-free)激活调控方法在多属性协同调控时面临范数失衡与方向抵消问题,而基于分类器的方法则因属性集合变更需重新训练,缺乏灵活性。其解决方案的关键在于提出 ORBIT(Orthogonal Rotation-Based Intervention Technique),通过奇异值分解(SVD)构建各属性调控平面的联合子空间,并在该子空间内施加单一保范旋转以指向综合目标方向,从而实现多属性的协调调控;同时引入自适应逐标记门控机制识别需修正的属性位置,并可选地添加增强项以强化初始投影较弱的属性,显著提升了多属性调控的强度与平衡性。此外,研究还构建了专注于行为倾向而非表层风格的多属性基准测试集 TraitFactory,验证了 ORBIT 在多个主流模型上的有效性与输出连贯性保持能力。

链接: https://arxiv.org/abs/2606.22357
作者: Narges Ghasemi,Amir Ziashahabi,Salman Avestimehr,Jonathan May
机构: Information Sciences Institute, University of Southern California (信息科学研究所,南加州大学); Department of Electrical and Computer Engineering, University of Southern California (电气与计算机工程系,南加州大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models are widely used in assistant settings, where controlling behavioral attributes is often essential. Activation steering modifies hidden-state representations at inference time, providing a lightweight, training-free mechanism that can be toggled at runtime. Existing methods, however, have focused primarily on steering a single attribute at a time. When multiple attributes must be controlled simultaneously, naive summation of per-attribute steering vectors suffers from norm imbalance and directional cancellation, while classifier-based approaches require retraining whenever the attribute set changes. We introduce ORBIT (Orthogonal Rotation-Based Intervention Technique), a training-free extension of rotation-based steering to the multi-attribute setting. Our method constructs a joint subspace from per-attribute steering planes via singular value decomposition and applies a single norm-preserving rotation within that subspace toward a combined target direction. Adaptive per-token gating identifies which attributes need correction at each position, and an optional additive boost strengthens attributes with weak initial projection. We also introduce TraitFactory, a new multi-attribute benchmark that focuses on behavioral tendencies rather than surface-level style. We evaluate ORBIT on TraitFactory and ToneBank across three models (Llama-3.2-3B, Qwen-2.5-7B, Llama-3.1-8B) while steering multiple attributes simultaneously, showing that it achieves stronger and more balanced multi-attribute steering than existing training-free baselines while better preserving output coherence.

[NLP-91] How Does Research Evolve? Tracing Cross-Domain Trajectories in NLP ML and CV with Claim-Grounded Typed Citations

【速读】: 该论文旨在解决科学演进路径难以准确建模与预测的问题,核心挑战在于现有引文图谱将复杂的学术互动简化为单一同质边类型,无法区分引文背后的多样化语义动机。其解决方案的关键在于构建首个基于主张(claim)的类型化引文图谱——SciTraj,通过将每条引文边关联到具体支持它的主张句,实现对科研进展中“方法扩展”“局限性回应”“未来方向实现”及“早期主张质疑”等多重角色的精细化刻画。该框架利用自然语言推理(NLI)验证四类主张驱动关系,并结合抽象余弦相似度与年份间隔规则控制两类仅依赖相似性的关系,确保语义准确性。SciTraj涵盖2015–2024年自然语言处理(NLP)、机器学习(ML)与计算机视觉(Vision)领域的32,559篇论文,包含573,126条有向边与六类关系类型,覆盖72.8%的论文的2.87亿条长度≥3的类型化演化轨迹,并支持时间划分的类型化链接预测任务。实验表明,该图谱可揭示领域内引文流动中的学科壁垒现象,并识别出视觉与大语言模型(LLM)相关研究的主题涌现趋势,同时通过年份洗牌可证伪性测试验证了时间结构的有效性,且人工标注一致性达到κ=0.74,精确率达79.9%,证明其可靠性和实用性。

链接: https://arxiv.org/abs/2606.22342
作者: Abdul Muntakim,Md Abdullah Al Hafiz Khan,Sadid Hasan,Yong Pei
机构: Kennesaw State University (肯尼萨州立大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How does research evolve, and what substrate would let us forecast where it goes next? Scientific progress is not simply a uniform accumulation of facts: ideas extend prior methods, address known limitations, realize proposed future directions, and sometimes dispute earlier claims. Existing citation graphs usually collapse these roles into a single homogeneous edge type, limiting how we can analyze scientific progress. We address this gap by proposing the SciTraj corpus, the first claim-grounded typed citation graph in which each edge is linked to the specific claim sentence that motivates it. Claim-bearing sentences are extracted from paper sections; four claim-driven relations are verified by NLI entailment against in-paper context, while two similarity-only relations are gated by abstract cosine and year-gap rules. SciTraj contains 32,559 papers from NLP, ML, and Vision (2015–2024), connected by 573,126 directed edges across six relation types, with NLI-verified claim seeds. Using SciTraj, we identify disciplinary siloing in typed citation flow and topic emergence concentrated in Vision and LLM-related work. The corpus also contains 287M typed trajectories of length \geq 3 , covering 72.8% of papers, and supports a temporally split typed link-prediction benchmark. A year-shuffle falsifiability test separates temporal structure from year-correlated content, and a 3-annotator pilot reports \kappa = 0.74 with 79.9% precision.

[NLP-92] BabelJudge: Measuring LLM -as-a-Judge Reliability Across Languages and Agent Trajectories

【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-judge)在自然语言处理(NLP)评估中广泛存在的系统性偏差问题,这些问题包括位置偏倚(position bias)、冗长性偏倚(verbosity bias)、顺序不一致性(order inconsistency)以及低资源语言中的跨语言可靠性下降(cross-lingual degradation)。传统基于原始准确率的评估方法无法揭示这些隐藏的缺陷,导致对模型性能的误判。其解决方案的关键在于“通过退化进行黄金标注”(gold-labelling by degradation):从高质量参考回答出发,施加可控扰动生成具有已知黄金标签的成对样本,从而无需人工偏好标注即可量化各类偏差。该方法被集成至开源框架BabelJudge中,支持对任意评判模型进行多维度可靠性审计,并进一步扩展至代理式评估(agentic evaluation),引入九种轨迹级扰动及三项新指标(工具准确性、幻觉检测率、轨迹长度偏倚)。实验表明,尽管Qwen2.5-7B-Instruct-4bit在英语和印地语上的表现尚可,但在斯瓦希里语中综合偏差惩罚后的可靠性得分显著下降至0.550(远低于英语的0.714),且顺序一致性降至0.480,表明其判断结果在槽位顺序交换下接近随机——这一严重失效模式仅通过本框架才能识别。BabelJudge以Python包形式发布,兼容11种评判模型后端。

链接: https://arxiv.org/abs/2606.22329
作者: Shreyas KC
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures. Source code, benchmark toolkit, and reproduction scripts at this https URL

点击查看摘要

Abstract:LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), they prefer longer responses regardless of quality (verbosity bias), and their reliability degrades sharply in lower-resource languages. We introduce BabelJudge, an open-source benchmark and reliability audit framework that measures all four failure modes – position bias, verbosity bias, order inconsistency, and cross-lingual degradation – on any judge model, without requiring human preference labels. The key insight is gold-labelling by degradation: starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction, eliminating annotation cost. We evaluate Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili and find that our composite bias-penalised reliability score drops from 0.714 in Hindi to 0.550 in Swahili, a gap that raw accuracy (0.835 vs. 0.660) understates. Swahili order consistency collapses to 0.480, meaning judge verdicts are near-random under slot-order swaps – a failure mode invisible to accuracy alone. We further extend the framework to agentic evaluation via nine trajectory-level perturbations (argument corruption, tool swaps, hallucinated calls, missing steps) and three new metrics: tool accuracy, hallucination detection rate, and trajectory-length bias. BabelJudge is released as a Python package supporting 11 judge backends. Code: this https URL

[NLP-93] Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习后训练(Reinforcement Learning Post-Training, RL post-training)过程中,因采用均匀数据采样策略而导致的语义结构忽略与训练策略能力动态变化响应不足的问题。现有方法普遍依赖随机或均匀采样,未能有效利用数据的内在语义层次结构,也难以捕捉策略演进过程中的关键学习阶段,从而限制了训练效率与最终性能。为此,论文提出自适应数据调度(Adaptive Data Scheduling, ADS),其核心在于构建一个双层级的数据调度框架:在聚类层面上,基于语义模式对样本进行组织,并动态调整不同语义簇间的采样分布,以巩固当前训练进展;在样本层面上,实施簇内调度,持续优先采样处于策略边界(policy-boundary)的高信息量样本,以提供更具区分性的相对优势信号。该方案通过引入语义感知与策略动态敏感的采样机制,显著提升了强化学习训练的针对性与有效性。实验结果表明,ADS在三种主流大模型及七个推理基准上相较组相对策略优化(Group Relative Policy Optimization, GRPO)平均准确率提升5.2%,且在不同目标设计下均表现出一致增益,验证了其作为通用性数据调度策略在LLM强化学习后训练中的潜力。

链接: https://arxiv.org/abs/2606.22305
作者: Zicheng Xu,Ruixuan Zhang,Yu-Neng Chuang,Xiuyi Lou,Hoang Anh Duy Le,Oren Gal,Alexander S. Szalay,Zhaozhuo Xu,Guanchu Wang,Vladimir Braverman
机构: Johns Hopkins University (约翰霍普金斯大学); Rice University (莱斯大学); University of Haifa (海法大学); Workato (工作机器人); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) achieve remarkable reasoning capabilities through reinforcement learning (RL) post-training. However, existing RL post-training commonly relies on uniform data sampling, which ignores the semantic structure of the training data and the changing capability of the training policy. To address these limitations, we propose Adaptive Data Scheduling (ADS), a dual-level data scheduling framework for pacing RL post-training that replaces uniform sampling with an adaptive distribution over semantic clusters and policy-boundary sample selection. At the cluster level, ADS organizes samples according to semantic patterns and maintains an adaptive inter-cluster distribution to solidify current training progress. At the sample level, ADS performs intra-cluster scheduling to continuously sample policy-boundary samples, which provides informative relative advantages. Experimental results across three LLMs and seven reasoning benchmarks demonstrate that ADS improves average accuracy by 5.2% over Group Relative Policy Optimization (GRPO). Notably, ADS consistently improves RL methods with different objective designs, highlighting its potential as a general data scheduling strategy for LLM RL post-training. The source code is available at: this https URL.

[NLP-94] From Speech to Text Corpora: Evaluating ASR-Based Data Acquisition for Low-Resource Fongbe and Hausa

【速读】: 该论文旨在解决低资源非洲语言缺乏可用于语言模型训练的文本语料库的问题。针对两种语音特征迥异的西非语言——声调丰富的福恩贝语(Fongbe,具有复杂的变音符号)和非声调的豪萨语(Hausa),研究提出利用自动语音识别(ASR)系统扩展其文本资源。解决方案的关键在于:针对福恩贝语,基于多语言语音模型(MMS-300M)在精心筛选的12.3小时福恩贝语数据集上进行微调,实现了9.48%的词错误率(WER),相较此前44.04%的基线降低了78%,并有效保留了对语言至关重要的声调变音符号;对于豪萨语,则采用已微调的Whisper-Small模型,从1,553个YouTube视频中提取并处理了424个视频(共45.49小时),生成6,770个转录片段。人类评估显示,豪萨语转录段平均质量评分为57.4/100,可满足语料库构建的基本要求;而福恩贝语仅为36.5/100,表明其转录结果仍需后处理或更优模型支持方可投入生产使用。研究最终公开了经过整理的数据集、微调模型、转录语料库及完整视频目录,遵循平台条款与伦理规范。

链接: https://arxiv.org/abs/2606.22274
作者: Mahounan Pericles Adjovi,Victor Olufemi,Roald Eiselen,Prasenjit Mitra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Low-resource African languages lack text corpora needed for language model training. We investigate whether ASR pipelines can extend text resources for two typologically distinct West African languages: Fongbe (tonal, diacritic-rich) and Hausa (non-tonal). We fine-tune MMS-300M on a curated 12.3-hour Fongbe dataset, achieving 9.48% WER on the ALFFA benchmark - a 78% relative reduction from the prior 44.04% baseline - while preserving tonal diacritics critical to the language. For Hausa, we apply an existing fine-tuned Whisper-Small model. We catalog 1,553 YouTube videos (236 hours) and process a subset of 424 videos (45.49 hours) selected to balance domain diversity with available computational resources, producing 6,770 transcribed segments. Human evaluation on 50 randomly sampled segments per language shows mean quality scores of 57.4/100 for Hausa and 36.5/100 for Fongbe, indicating that while Hausa transcriptions approach acceptable quality for corpus construction, Fongbe transcriptions require post-processing or improved models for production use. We release the curated dataset, fine-tuned model, transcribed corpus, and full video catalog following platform terms and ethical guidelines.

[NLP-95] MixedPEFT: Combining Multiple PEFT Methods with Mixed Objectives for Unsupervised Domain Adaptation

【速读】: 该论文旨在解决预训练语言模型在新领域应用时面临的挑战,即全量微调(full fine-tuning)计算成本高且易导致灾难性遗忘(catastrophic forgetting)的问题。其核心解决方案是提出一种新颖的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略,用于无监督域适应(Unsupervised Domain Adaptation, UDA)。关键在于设计了一种融合可逆适配器(invertible adapters)与低秩适配(Low-Rank Adaptation, LoRA)的自定义联合架构,并在一个统一的参数高效框架内,同时优化源域有标签数据上的分类性能与目标域无标签数据上的掩码语言建模(Masked Language Modeling, MLM)任务。通过双目标协同优化,该方法在保留目标域知识的同时有效适应源域任务。在涵盖20种域偏移的多类型自然语言推理(MNLI)数据集上的实验表明,该方法仅使用模型7%的可训练参数,便在性能上显著优于现有最先进方法(如UDapter、DSN和全量微调的DANN基线),实现了1.41、1.26和0.86个百分点的提升,验证了精心设计的PEFT组合与并发优化机制能够超越传统参数高效方法及全量微调方案,为参数高效无监督域适应树立了新基准。

链接: https://arxiv.org/abs/2606.22272
作者: Mohammed Rawhani,Dervis Karaboga,Ozkan Ufuk Nalbantoglu,Alper Basturk,Bahriye Akay
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 tables. Builds upon our preliminary work presented at UBMK 2024

点击查看摘要

Abstract:Pre-trained language models struggle when applied to new domains, as full fine-tuning is computationally expensive and prone to catastrophic forgetting. This study addresses this challenge by presenting a novel parameter-efficient strategy for unsupervised domain adaptation that combines custom PEFT architectures with mixed-objective training. Our approach simultaneously optimizes classification performance on labeled source domain data and masked language modeling (MLM) on unlabeled target domain data, preserving target domain knowledge while adapting to source domain tasks. Our method employs a custom union of invertible adapters and Low-Rank Adaptation (LoRA) within a unified parameter-efficient framework. Through comprehensive evaluation on the Multi-Genre Natural Language Inference (MNLI) dataset across 20 domain shifts, our approach achieves significant improvements over existing methods: 1.41 percentage points over the current parameter-efficient state-of-the-art UDapter, 1.26 percentage points over the fully-tuned DANN baseline, and 0.86 percentage points over DSN, while utilizing only 7% of the model’s trainable parameters. These results establish new benchmarks for parameter-efficient unsupervised domain adaptation and demonstrate that carefully designed PEFT combinations with concurrent optimization can outperform both existing parameter-efficient methods and traditional fully-tuned approaches.

[NLP-96] Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks Failures and Metric Reliability

【速读】: 该论文旨在解决当前大语言模型(LLMs)在低资源西非语言——豪萨语(Hausa,属亚非语系)与丰贝语(Fongbe,属尼日尔-刚果语系)之间的英译质量评估问题,尤其关注标准自动评估指标是否能可靠反映母语者的真实评价。其核心挑战在于:不同语言间翻译表现差异显著,且现有自动指标在跨语言场景下的有效性存在严重不确定性。解决方案的关键在于通过多尺度(500至10,000句)实证研究,系统验证多种自动指标(BLEU、chrF++、TER、COMET、BERTScore)与人类评分的一致性,并揭示神经类指标(如BERTScore)在低资源语言中出现的嵌入坍塌(embedding collapse)现象,导致其无法有效区分翻译质量。研究进一步指出,模型性能在不同低资源非洲语言之间不可外推,且需至少2,500句样本量才能获得稳定可靠的系统排名,否则易产生虚假结论。因此,论文提出应采用多指标综合评估策略,并对神经类指标在低资源语境中的使用保持审慎。

链接: https://arxiv.org/abs/2606.22269
作者: Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra
机构: Carnegie Mellon University Africa, Kigali, Rwanda; North-West University, South Africa
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 10 tables

点击查看摘要

Abstract:We investigate the translation quality of current large language models (LLMs) for English-to-Hausa and English-to-Fongbe - two typologically distinct West African languages from the Afroasiatic and Niger-Congo families respectively - and evaluate whether standard automatic metrics reliably reflect human judgment for these low-resource languages. We evaluate four models (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, and Qwen2.5-7B) at progressive scales (500 to 10,000 sentences) using automatic metrics (BLEU, chrF++, TER, COMET, BERTScore) validated against native-speaker judgment. Our results reveal three key findings. First, translation quality varies substantially by language: Hausa achieves acceptable quality (human scores 4.0-4.5/5) while Fongbe achieves poor quality (1.0-2.2/5), with a consistent 3x BLEU gap across all systems. Second, model rankings differ by language - Gemini leads for Fongbe while GPT-4o leads for Hausa by human evaluation - indicating that performance on one low-resource African language does not predict performance on another. Third, metric-human correlation varies dramatically: perfect rank correlation for Fongbe (rho=1.0) but weak correlation for Hausa (rho=0.5), where human evaluators preferred GPT-4o despite all automatic metrics ranking Claude first. We further show that neural metrics like BERTScore exhibit embedding collapse (within-language similarity 0.99) for both languages, limiting their ability to differentiate translation quality. Based on these findings, we recommend multi-metric evaluation for low-resource African languages, with particular caution when interpreting neural metrics. We establish that minimum sample sizes of n=2,500 sentences are required for stable system rankings, as smaller samples produced artifact findings that reversed at scale.

[NLP-97] SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models

【速读】: 该论文旨在解决自回归Transformer解码器在顺序微调过程中面对动态课程分布时出现的显著遗忘问题,尤其关注模型在持续学习中对先前任务知识的保留能力。其核心解决方案是提出一种356M参数的混合序列解码器SamatNext v0.2-B,该模型通过交替使用类Differential-Attention的层与受DeltaNet启发的简化线性状态混合层(采用RMS归一化和输出缩放校准),以优化模型在动态学习环境下的记忆保持与可塑性平衡。实验在受控的分阶段Python代码课程设置下进行,结果显示,相较于参数匹配的Transformer基线模型,SamatNext在第5阶段测试集上达到100.0%通过率,同时保留了98.8%的第3阶段语义行为,且在第2E早期语法测试中取得12.0%的表现,显著优于基线模型(仅保留6.0%的第3阶段行为)。尽管两者在长时程早期任务保留方面仍表现不佳,但结果表明该架构在特定设定下实现了更优的保留性/可塑性权衡,为缓解灾难性遗忘提供了新的结构设计思路。

链接: https://arxiv.org/abs/2606.22248
作者: Samat Zharassov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 3 tables. Technical report. Code and reproducibility artifacts: this https URL

点击查看摘要

Abstract:Standard autoregressive Transformer decoders can often exhibit substantial forgetting under sequential fine-tuning on shifting curriculum distributions. This technical report evaluates SamatNext v0.2-B, an experimental 356M-parameter hybrid sequence decoder that alternates Differential-Attention-style layers with DeltaNet-inspired simplified linear-state mixer layers using RMS normalization and output scale calibration. We study the model under a controlled staged Python code curriculum and compare it with a parameter-matched Transformer baseline. In this setting, SamatNext v0.2-B achieves a 100.0% pass rate on the controlled Stage 5 holdout while retaining 98.8% of adjacent Stage 3 semantic behavior and reaching 12.0% on the Stage 2E early syntax holdout. The strongest Transformer baseline reaches 97.6% on Stage 5 but retains only 6.0% of Stage 3 behavior. Both architectures remain weak on long-horizon early-stage retention, so the result should be interpreted as evidence of an altered retention/plasticity tradeoff in this controlled setting, not as a general solution to catastrophic forgetting. Code, model specifications, evaluation scripts, and result tables are provided for independent verification.

[NLP-98] Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

【速读】: 该论文旨在解决生成式语言模型在缺乏动态表征适应能力的情况下,能否基于具身感知经验(grounded experience)实现词汇意义的习得、稳定与有效使用这一核心问题。其关键在于构建一个可验证的实验框架——Lexical Consensus,通过冻结的DINOv2视觉嵌入作为结构化感知基底,结合卡罗尔风格的伪词(Carroll-style nonce words)与可解释的词汇学习器,系统检验人工代理在不同概念复杂度下的词汇学习能力。研究发现,词汇习得存在显著的感知一致性梯度:原生类别最易学习,一致泛化仍可习得,中等离散概念性能下降,远距离离散概念趋近随机水平;进一步的预注册CIFAR-100解离实验表明,该梯度由感知距离主导(偏决定系数R²=0.245,p<1e-7),而非语义相关性(偏R²=0.002,p=0.660)。此外,双向评估揭示命名与检索机制的本质差异——基于实例的记忆机制在标签到图像检索中优于原型中心机制,凸显记忆保真度作为独立于命名准确性的认知维度。最终,通过消融控制、同质候选池测试及表征重构无显著变化的结果,证实冻结的感知几何结构虽能支持词汇具身化,但亦限制了未进行表征适配时所能习得的内容边界。

链接: https://arxiv.org/abs/2606.22207
作者: Patricio M. Vera
机构: Neurocreaciones
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 41 pages, 12 figures, 9 tables. Code and experiment artifacts available at this https URL

点击查看摘要

Abstract:Artificial intelligence systems are commonly evaluated through task performance and behavioral imitation, but such evaluations leave open whether an artificial agent can acquire, stabilize, and use new lexical meanings from grounded experience. This paper introduces Lexical Consensus, an experimental framework for studying grounded word learning over a structured perceptual substrate. Using frozen DINOv2 visual embeddings, Carroll-style nonce words, and interpretable lexical learners plus linear baselines, we test whether agents can acquire artificial labels for visual concepts, generalize them bidirectionally, and stabilize them across controlled settings. The main result is a robust perceptual-coherence gradient: native categories are easiest to learn, coherent overextensions remain learnable, mid-range disjunctive concepts degrade, and far-disjunctive concepts approach chance. A pre-registered CIFAR-100 dissociation experiment confirms that this gradient is governed by perceptual distance rather than semantic relatedness: perceptual distance predicts acquisition accuracy (partial R^2 = 0.245, p 1e-7), while semantic distance adds no significant explanatory power (partial R^2 = 0.002, p = 0.660). Bidirectional evaluation shows that naming and retrieval are distinct: exemplar-based mechanisms outperform centroid prototypes in label-to-image retrieval, exposing a memory-fidelity dimension separate from naming accuracy. Falsification controls, homogeneous candidate-pool evaluations, and null results on representational restructuring indicate that frozen perceptual geometry both enables lexical grounding and limits what can be acquired without representational adaptation. Comments: 41 pages, 12 figures, 9 tables. Code and experiment artifacts available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.22207 [cs.CL] (or arXiv:2606.22207v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.22207 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-99] he Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中作为黑箱分类器时,其置信度评分(confidence score)在阈值设定上的分辨率不足问题。具体而言,尽管现有方法关注置信度是否校准或排序性能是否良好,但一个关键的部署性问题——置信度分数可被精细划分到何种程度以支持灵活的风险控制——长期被忽视。作者提出“评分粒度差距”(score granularity gap)这一概念,用以衡量置信度分数所能提供的有效阈值数量。研究通过在25个模型-数据集组合上对比七种不同置信度构建方式(包括单次输出的语义化数值、词元概率、多轮查询聚合等),发现即使经过正确转换为类别概率,单次推理生成的语义化置信度仍仅取少量离散值,导致操作者只能使用有限的粗粒度阈值,即便其排序能力较强。研究进一步揭示了不同构造方法对粒度差距的影响及其推理成本与排序性能之间的权衡关系,尤其指出多轮查询聚合虽能提升弱模型的表现,却可能损害强模型的排序效果。基于这些发现,论文提出了面向实际部署的具体指导原则,以平衡置信度评分的分辨率、计算开销与排序准确性。

链接: https://arxiv.org/abs/2606.22179
作者: Ao Sun,Tian Sun,Jiaxing Geng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as black-box classifiers in pipelines that automate confident decisions and route uncertain ones to human review. Such selective prediction needs a confidence score that an operator can threshold at a chosen risk level. Prior work asks whether LLM confidence is well calibrated or well ranked; we ask a complementary, deployment-oriented question that has been largely overlooked: at what resolution can the score be thresholded? We call the answer the score granularity gap. Through a controlled comparison of seven ways to build a confidence score, from a single verbalized number, to token probabilities, to querying the model many times and combining the answers, across 25 model-dataset pairs (9 LLMs, 3 benchmarks), we find that single-shot verbalized confidence, once correctly converted to a class probability, ranks cases surprisingly well, yet takes only a handful of distinct values. It therefore offers an operator only a few coarse thresholds, no matter how well it ranks. We show which constructions widen this gap, at what inference cost, and with what effect on ranking, notably that multi-query aggregation helps weak models but can degrade already-strong ones. We translate these trade-offs into concrete deployment guidance.

[NLP-100] π-RAG : Oblivious Retrieval via Semantic Quantization and Transcendental Addressing for Large Language Models

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)架构在敏感数据检索过程中存在的隐私泄露与不可靠性问题,具体表现为:原始向量嵌入可能遭受反演攻击(inversion attacks),且检索过程存在非确定性失败风险。其核心解决方案是提出一种名为π-RAG的新架构,通过引入圆周率π(π)作为超越数熵源,构建一个不可篡改的间接寻址层,将大语言模型(LLM)与私有数据存储彻底解耦。该架构的关键在于利用π的数学固有性实现不可变的随机性,并结合语义量化层(Semantic Quantization Layer),将用户查询投影至预计算的规范意图中心点(Canonical Intent Centroids)流形上;通过加密盐值(cryptographic salt)将这些中心点映射为确定性偏移量,生成唯一的π-key,用以指向标准化的数据载荷。这一机制确保了推理过程对数据内容完全不可知(oblivious),同时实现了确定性随机性、可审计性与差分隐私的统一,显著提升了金融、医疗等高合规性领域中的安全性与可靠性。

链接: https://arxiv.org/abs/2606.22153
作者: Aniket Wattamwar,Mrunal Kakirwar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces \pi -RAG, a novel architecture for oblivious retrieval that decouples Large Language Models (LLMs) from sensitive data storage without sacrificing semantic understanding. Traditional Retrieval-Augmented Generation (RAG) architectures expose raw vector embeddings to potential inversion attacks and nondeterministic retrieval failures. To address this, we utilize the digits of \pi as a source of transcendental entropy, creating an immutable indirection layer between the LLM and private records. The value \pi provides immutability, is uneditable and math governs it. The architecture also introduces a Semantic Quantization Layer. This layer projects user inputs onto a pre-computed manifold of Canonical Intent Centroids. RAG performs vector cosine similarity but here it maps the centroids to deterministic offsets via cryptographic salt. The resulting \pi -key is a pointer to standardized payload from the actual datastore. By replacing direct access to the datastore via LLM with this transcendental layer, \pi -RAG mathematically guarantees that the inference remains oblivious to the data. This architecture unifies deterministic randomness, auditability, and differential privacy, demonstrating high efficacy for high-compliance sectors such as finance and healthcare.

[NLP-101] BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences Structures and Language

【速读】: 该论文旨在解决现有生物领域基础模型在多模态融合与多实体覆盖之间存在的割裂问题:一方面,已有的多模态模型通常仅限于单一生物实体类型(如仅蛋白质或仅分子),无法实现跨实体的统一建模;另一方面,能够覆盖多种生物实体的模型往往缺乏对结构信息的显式建模,或依赖适配器(adapter-based)架构,导致模型无法原生生成其可读取的模态。为此,本文提出BioMatrix,首个原生支持分子与蛋白质的序列、结构及自然语言三种模态统一建模的多模态基础模型,其核心创新在于采用统一的离散分词方案,将分子序列(支持SMILES与SELFIES表示)、分子结构、蛋白质序列、蛋白质结构以及自然语言全部映射至共享的离散标记空间,并在单一解码器架构下以统一的下一个标记预测目标进行训练,无需外部编码器、投影适配器或模态特异性输出头。该设计实现了所有模态的输入与输出的一致性与原生性。基于Qwen3语言模型(1.7B和4B参数量),BioMatrix在包含通用文本、领域文本、分子与蛋白质序列与结构视图以及跨模态语料(整合生物分子实体与科学文本并利用分子-蛋白质、蛋白质-蛋白质相互作用数据关联不同实体)的总计3044亿个标记上持续预训练,并在涵盖6大类共80项下游任务(包括单实体与多实体的理解与生成任务)的综合性评估中,在77项任务上达到或超越现有最优性能,证明了原生多模态、通用型模型在广泛生物任务中可有效替代甚至超越专用模型的可行性。

链接: https://arxiv.org/abs/2606.22138
作者: Qizhi Pei,Zhimeng Zhou,Yi Duan,Yiyang Zhao,Wei Li,Han Guo,Liang He,Chengping Li,Chang-Yu Hsieh,Conghui He,Rui Yan,Lijun Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective – without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories – encompassing single-entity and multi-entity understanding and generation tasks across and within modalities – BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.

[NLP-102] From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLM s

【速读】: 该论文旨在解决当前大语言模型(LLM)在时间序列分析任务中性能提升有限的核心问题,其根本原因在于现有任务设计与LLM的优势不匹配:多数方法将时间序列理解简化为低层次的曲线拟合预测,忽视了真实世界时序数据所蕴含的语义、上下文及推理复杂性。为此,论文提出TSCognition——一个面向多维时间序列推理的多模态基准,整合来自15个公开数据源的真实时间序列与文本信息,构建约4.1万条围绕五类认知推理任务(解码、定位、推断、外推与决策)的问答样本。在此基础上,提出TSAlign统一框架,通过分块编码将时间序列压缩为紧凑的局部表示,并利用门控残差注入与多变量对齐机制,实现时间序列表征与LLM嵌入空间语义方向的有效对齐。实验表明,TSAlign在TSCognition和公开的TimerBed基准上均显著优于现有的LLM、视觉-语言模型(VLM)及时间序列问答基线,同时大幅降低计算开销。

链接: https://arxiv.org/abs/2606.22126
作者: Xin Qiu,Junlong Tong,Yao Zhang,Yunpu Ma,Wei Zhang,Xiaoyu Shen
机构: Eastern Institute of Technology(东方理工大学); Zhejiang University(浙江大学); Munich Center for Machine Learning, LMU(慕尼黑大学机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time series analysis has recently been coupled with Large Language Models (LLMs) to leverage their reasoning and world knowledge capabilities, yet gains remain limited. We attribute this to a fundamental mismatch between existing task formulations and LLM strengths: most settings reduce time series understanding to curve-fitting systems, focusing on low-level prediction while ignoring the semantic, contextual, and reasoning-intensive nature of real-world temporal this http URL address these limitations, we introduce TSCognition, a multimodal benchmark for multi-dimensional time series reasoning. It collects real-world time series and textual information from 15 public sources and constructs approximately 41K QA samples around five cognitive reasoning tasks: Decoding, Grounding, Inferring, Extrapolating, and Acting. Building on this, we further propose TSAlign, a unified framework that encodes time series into compact patch-level representations and aligns them with semantic directions in the LLM embedding space via gated residual injection and multivariate this http URL show that TSAlign outperforms existing LLM, VLM, and time series QA baselines on TSCognition and the publicly available TimerBed benchmark while substantially reducing computational this http URL is available at: [this https URL](this https URL)

[NLP-103] Plurification in/of language technology – The integration of culture in next-generation AI

【速读】: 该论文旨在解决如何在自然语言处理(Natural Language Processing, NLP)中实现“文化”的可操作化问题,进而揭示在技术设计中纳入多元文化背景的可能性与局限性。其核心挑战在于,单纯通过增加“他者文化”的数据样本无法真正实现文化适配,关键在于引入多元认识论(plural epistemologies),即承认并容纳多种基于地方语境的知识体系。为此,论文采用社会技术性语言技术(Language Technology, LT)设计的五层技术活动模型,系统梳理和整合了当前NLP领域对文化的处理方式。分析表明,尽管现有研究在提升系统文化敏感性方面取得进展,但多数方法仍停留在输出或表征层面,未能深入回应权力结构、治理机制及社会语境等深层议题。因此,论文强调,实现文化可操作化不仅依赖技术调整,更需一种反思性且多元的社会技术路径,以应对计算形式化在表达多语言与社会文化多样性时的内在张力与边界。

链接: https://arxiv.org/abs/2606.22097
作者: Gertraud Koch,Fausto Giunchiglia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paper explores how “culture” can be operationalised in Natural Language Processing (NLP) and what this reveals about the possibilities and limits of considering a plurality of cultural backgrounds in technological design. It proposes that cultural alignment cannot be achieved only by adding more examples of “other cultures”, rather it requires plural epistemologies: allowing multiple, locally grounded ways of knowing. To analyze how this plurality of knowing can be addressed in NLP, the paper uses a socio-technical model of language technology (LT) design, the five layers of technological activity model, for collecting and systematizing approaches to culture in NLP. The analysis shows that while NLP research has made progress toward more culturally sensitive systems, many approaches remain partial, addressing “culture” primarily at the level of output or representation while leaving deeper questions of power, governance, and social context unresolved. The paper concludes that operationalising culture requires much more than technical adaptation; it suggests a reflexive and plural socio-technical approach that navigates potentials and limits of computational formalisation for accounting multiple linguistic and socio-cultural backgrounds.

[NLP-104] Can Reasoning Models Detect Changes to their Chains of Thought?

【速读】: 该论文旨在解决在生成式 AI (Generative AI) 的思维链(Chain of Thought, CoT)中进行干预时,模型是否能够察觉这些修改的问题。具体而言,研究关注在推理过程中或推理结束后,当思维链被预先填充(如使用更强模型的推理路径或来自其他模型的思维链)时,模型能否识别出其思维链被篡改。解决方案的关键在于系统性评估当前推理模型对自身或他人思维链修改的检测能力,结果表明:(i)模型的检测准确率极低,仅表现出微弱的察觉能力;(ii)模型难以识别修改的具体方式;(iii)模型对自身与他人思维链的修改检测能力无显著差异。这说明当前生成式 AI 模型对思维链干预具有较强的“盲区”,为安全可控的推理编辑提供了潜在可行性。

链接: https://arxiv.org/abs/2606.22085
作者: Sathvik Napa,Utkarsh Singh,Chengyuan Xue,Miriam Wanner,William Walden
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:There are many reasons one may want to edit a model’s chain of thought (CoT) – e.g., to prefill it with reasoning from a stronger model or to remove steps that may yield unsafe outputs. The success of these interventions plausibly depends on a model’s inability to notice them, as the model may alter its behavior if it suspects tampering. In this work, we study whether recent reasoning models are able to detect such interventions on their CoTs under a variety of conditions: both during reasoning and after it, and when prefilled both with their own CoTs and with those of other models. Broadly, we find that (i) models exhibit only very modest detection accuracy; (ii) models struggle to identify how their CoT was modified; and (iii) models are about as good at detecting changes to their own CoTs as to those of other models.

[NLP-105] Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining

【速读】: 该论文旨在解决密集术语领域(如医学)编码器在预训练过程中因依赖小规模人工标注语料而导致的可扩展性差与写作风格多样性不足的问题,尤其在非英语临床语境下这一瓶颈更为突出。其核心解决方案包含两个互补机制:一是基于医学术语密度的文档筛选,用于保留富含医学术语的文本;二是信号增强型重述,利用大语言模型将原始网页文本重写为医学实体上下文更丰富、术语密度更高的变体。实验表明,术语密度过滤优于广泛使用的教育质量过滤,二者协同效果更佳;而重述策略单独使用即可提升原始网络数据表现,与筛选后数据混合使用时取得最大增益。基于此方法构建了法语医学预训练语料FineMed及DoctoBERT系列法语医学编码器,在DrBenchmark公开基准和专有临床命名实体识别任务上均达到当前最优性能。

链接: https://arxiv.org/abs/2606.22079
作者: Bofeng Huang,Jacques Sun,Diane Bouchacourt,Nicolas Barascud,Fajwel Fogel
机构: Doctolib
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code, models, and data: this https URL

点击查看摘要

Abstract:Web data curation has been widely studied for decoder Large Language Model (LLM) pretraining. Encoders for dense-terminology domains such as medicine, by contrast, are pretrained on small, manually-curated corpora that limit scalability and writing style diversity, a bottleneck even more severe in non-English clinical settings. Whether web-scale data curation also benefits encoder Masked Language Modeling (MLM) in a dense-terminology domain remains an open question. To address this, we introduce two complementary levers. Medical-term density filtering selects documents rich in medical terms. Signal-amplifying rephrasing uses an LLM to rewrite documents into denser variants with broader entity contexts. We instantiate the recipe on French medical NLP. The medical-term density filter outperforms the widely-used educational quality filter on downstream medical tasks, and the two complement each other. Signal-amplifying rephrasing alone improves on raw web data, and mixing it with filtered web data produces the largest gain. The recipe yields FineMed, a French medical pretraining corpus, and DoctoBERT, a state-of-the-art French medical encoder family evaluated on both the public benchmark DrBenchmark and a proprietary clinical Named Entity Recognition (NER) task.

[NLP-106] NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming

【速读】: 该论文旨在解决自然语言到代码(NL2Code)研究中长期存在的适配性问题,即现有方法主要针对文本型编程语言设计,难以有效应对以Scratch为代表的块状编程环境所特有的事件驱动、可视化组合及多并发脚本并行等特性。其核心挑战在于传统NL2Code评估范式依赖表面词汇重叠(如BLEU、F1),无法准确衡量自然语言描述与生成程序之间的语义一致性。为此,作者提出NL2Scratch——一个可执行的基准数据集,包含311,648个经解析验证的自然语言-程序对,程序部分源自真实Scratch项目,并与语义对齐的自然语言描述配对。为实现更可靠的语义评估,论文引入语义对齐一致性(Semantic Alignment Consistency, SAC),一种基于槽位(slot-level)的可解释性度量指标,用于量化描述与程序在操作动作、条件判断、数值参数等关键语义槽位上的匹配程度。基于SAC,研究构建了23,594个语义验证样本池及一个槽位平衡的800样本诊断基准。实验结果表明,尽管大语言模型(LLM)在词元级相似度上表现优异(F1 > 0.93),但在语义对齐方面仍存在显著缺陷,尤其在长序列任务中,操作类槽位错误频发,揭示了传统指标下难以察觉的模型失效模式。因此,该研究的关键突破在于通过SAC实现了对生成质量的深层语义评估,推动NL2Code向更贴近实际编程场景的方向发展。

链接: https://arxiv.org/abs/2606.22061
作者: Heejin Do,Alexandre Ballenghien,Yang Wu,April Yi Wang
机构: ETH Zurich (苏黎世联邦理工学院); ETH AI Center (苏黎世联邦理工人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Block-based programming environments such as Scratch are widely used in early programming education, yet natural-language-to-code (NL2Code) research has focused primarily on text-based languages. Scratch programs are event-driven, visually compositional, and distributed across concurrent scripts, making conventional NL2Code assumptions and evaluation insufficient. We introduce NL2Scratch, an executable benchmark for natural-language-to-Scratch generation comprising 311,648 parser-valid NL–program pairs, whose program side is extracted from real Scratch projects and paired with semantically aligned NL descriptions. For reliable evaluation beyond surface overlap, we propose Semantic Alignment Consistency (SAC), an interpretable slot-level metric for measuring semantic agreement between descriptions and programs. With SAC, we construct a semantically validated pool of 23,594 examples, and a slot-balanced 800 diagnostic benchmark. Experiments across instruction-tuned and fine-tuned LLMs reveal a notable gap between lexical similarity and semantic alignment: models achieving token-level F1 above 0.93 often fail to attain perfect SAC, particularly on longer examples. Errors concentrate on operational slots like actions, conditions, and numeric arguments, exposing failure modes largely invisible under conventional metrics.

[NLP-107] Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study INTERSPEECH2026

【速读】: 该论文旨在解决日语中字符到音素(Grapheme-to-phoneme, G2P)转换的准确性与鲁棒性问题,以支持可控且高质量的文本到语音(Text-to-Speech, TTS)系统。其核心挑战在于如何有效处理日语复杂的发音规则,尤其是汉字音读与训读的歧义性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)所具备的广泛语言知识,通过两种提示策略进行对比:一种是“解析模式”(parse mode),即先由LLM完成形态分析,再结合规则驱动的假名(kana)转换;另一种是“直接模式”(direct mode),即让LLM直接预测假名读音。实验结果表明,模型规模、版本及专为日语优化的训练数据是决定性能的关键因素,最佳的LLM在解析模式下实现了低于0.52%的假名字符错误率,显著优于传统形态分析器(1.03%)。此外,研究发现解析模式普遍优于直接模式,原因在于规则后处理能够减轻LLM在复杂发音规则上的负担。进一步验证表明,将LLM生成的假名输入至假名输入型TTS系统,可获得比端到端TTS更优的发音质量,凸显了分阶段处理在提升语音合成自然度方面的优势。

链接: https://arxiv.org/abs/2606.22009
作者: Tomoki Koriyama
机构: CyberAgent(日本)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: accepted to Interspeech 2026

点击查看摘要

Abstract:Grapheme-to-phoneme (G2P) conversion is essential for controllable and robust text-to-speech, and large language models (LLMs), with broad linguistic knowledge, offer a promising approach. We benchmarked over 30 LLMs on Japanese G2P, comparing them with conventional morphological analyzers on 3000 manually annotated sentences. We evaluated two prompting strategies: a parse mode, where the LLM performs morphological analysis followed by rule-based kana conversion, and a direct mode, where the LLM directly predicts kana readings. The results show that model size, version, and Japanese-specialized training are key factors, with the best LLMs achieving kana character error rate below 0.52% vs. the best conventional tool (1.03%). Parse mode outperforms direct mode for most models, as rule-based post-processing relieves the LLM of handling complex pronunciation rules. We also show that feeding LLM-predicted kana into a kana-input TTS yields better pronunciation than end-to-end TTS.

[NLP-108] CFAgent Bench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents DATE ACL

【速读】: 该论文旨在解决自主式建筑金融代理(autonomous construction-finance agents)在真实复杂软件栈中执行多模态、跨系统任务时的可评估性与可部署性问题。现有基准多依赖静态轨迹,难以反映真实环境下的动态交互与功能正确性,且缺乏对关键财务操作(如付款、电子签名等)的安全约束。为此,论文提出CFAgentBench,一个可复现、自托管的执行环境与基准测试平台,覆盖美国建筑金融团队实际使用的ERP、项目管理、邮件、文档、付款申请、薪酬、认证薪酬、留置权豁免及银行/资金门户等系统。其核心创新在于构建了一个基于真实业务来源的1,014个机器可度量的任务规范集,并从中提取40个经过自验证的子集(54个含项目管理扩展),形成由“oracle验证执行器”驱动的可运行评测套件。不同于传统静态评估,该基准采用可执行环境,通过状态差分、禁止副作用检查及输出正则匹配实现功能正确性判定,仅使用大语言模型(LLM)判断回复质量而非作为奖励信号。尤为关键的是引入“资金流动防护机制”(money-movement guard),在278个涉及支付、薪资、电子签名或电子申报的任务中,即使行为逻辑正确,若未暂停等待人工审批即视为失败,从而强制代理具备安全意识。实验表明,当前最强模型在单次尝试(pass^1)下达到0.67,但重复五次(pass^5)仅0.38,下降43%,暴露出单次成功率严重高估实际可部署能力,揭示出模型在稳定性与领域间异质性方面的显著缺陷,凸显了持续验证与容错设计在建筑金融自动化中的必要性。

链接: https://arxiv.org/abs/2606.22000
作者: Rishi Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 28 pages, 2 figures, 13 tables. Benchmark, environment spec, and app contract released. First open-weight three-model sweep (k=5) on a 40-task oracle-validated executable suite; frontier-model leaderboard committed in the roadmap

点击查看摘要

Abstract:We introduce CFAgentBench, a reproducible, self-hostable environment and benchmark for autonomous construction-finance agents: a CFO/controller-class agent operating across the real software stack a US construction finance team runs - ERP, project management, email, documents, pay applications, payroll, certified payroll, lien waivers, and bank/treasury portals. It contains 1,014 machine-gradeable task specifications across 8 domains and 77 families, every family grounded in a real source; a self-validated subset of 40 tasks (54 with a project-management extension) is compiled into oracle-validated executable evaluators, the runnable suite reported here. Following WebArena, the benchmark runs on an executable environment rather than static traces: 35 mock applications (31 reconciled to one company book, plus 4 PM platforms) over 9 archetypes, each implementing a uniform self-hostable app contract, so every task is graded by functional correctness - a state diff plus forbidden-side-effect checks plus required-output regexes - with an LLM judge used only for reply quality, never as reward. A distinguishing principle is a money-movement guard: 278 instances embed a payment, payroll, e-signature, or e-filing step where the correct behavior is to stop and stage for human approval, and executing even the correct transaction fails the task. The public split (n=711) is sized for a 95% Wilson half-width of +/-4.1%; a private, contamination-protected split (n=303) is reserved for remote scoring. In a first three-model open-weight sweep (k=5), the strongest agent reaches pass^1 = 0.67 but only pass^5 = 0.38 - losing 43% of its successes when required to repeat them under temperature-0 decoding. The within-model pass^1 to pass^5 collapse and sharp per-domain heterogeneity are clear evidence that single-attempt accuracy overstates deployable construction-finance competence.

[NLP-109] Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR INTERSPEECH2026

【速读】: 该论文旨在解决大规模多语言自动语音识别(ASR)系统在真实场景部署中面临的语码转换(Code-Switching, CSW)难题。现有方法通过在合成语码转换数据上微调模型虽可提升对语码转换的处理能力,但往往导致原有强单语基线性能下降。为克服这一问题,论文提出基于贝叶斯因子分解的适应方法(Bayesian Factorized Adaptation),其关键在于以高效且非破坏性的方式将与语码转换相关的新知识融入预训练模型,避免覆盖已有语言专长。该方法仅需少量合成数据即可实现显著性能提升:在语码转换词汇上的识别错误率降低32.87%,整体词错误率(WER)改善5.31%,同时完全保持单语性能。研究结果表明,有效实现语码转换适应的核心不在于数据复杂度,而在于如何实现知识的精准整合。

链接: https://arxiv.org/abs/2606.21990
作者: Enes Yavuz Ugan,Alexander Waibel
机构: Interactive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany; InterACT, Carnegie Mellon University (CMU), USA
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:Code-switching (CSW) remains challenging for large multi-lingual ASR systems in real-world deployment. While fine-tuning on synthetic CSW data is possible, it generally degrades strong monolingual baselines. Our goal is to preserve these capabilities while extending models to handle complex code-switching, including morphological variations across languages. We propose Bayesian factorized adaptation, which learns to efficiently integrate switching-relevant knowledge into strong pretrained models without overwriting existing capabilities. Requiring only a small amount of synthetic data, our approach reduces transcription errors by 32.87% on code-switched words while improving overall WER by 5.31%, all while maintaining mono-lingual performance. Our results demonstrate that effective CSW adaptation depends more on knowledge integration than data complexity.

[NLP-110] Can LLM s Control Readability? A Multi-Dimensional Evaluation Framework for CEFR-Controlled Arabic Generation LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成阿拉伯语文本时难以可靠控制可读性水平的问题。尽管现有模型能够生成流畅的阿拉伯语内容,但其对语言复杂度的精确调控能力尚不明确,限制了其在自适应语言学习系统中的应用。论文提出了一种多维度评估框架,用于评估指令遵循型大语言模型在欧洲语言共同参考框架(Common European Framework of Reference for Languages, CEFR)约束下生成阿拉伯语文本的能力。该框架的关键在于整合结构化提示(structured prompting)、基于已验证Taha-19模型的自动可读性预测、词汇约束验证以及句法复杂度分析。实验结果表明,采用CEFR引导的结构化提示并结合词汇约束,可实现与参考语言特征高度一致的生成效果(0.91余弦相似度),且与预测可读性水平近乎完全吻合(0.99相关性),显著优于无约束提示。这一发现为将具备可读性感知能力的阿拉伯语文本生成技术集成至自适应教育系统提供了实证基础。

链接: https://arxiv.org/abs/2606.21981
作者: Nour Rabih,Chatrine Qwaider,Ted Briscoe
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 PAGES, READIxTSAR WORKSHOP, LREC 2026

点击查看摘要

Abstract:While Large Language Models (LLMs) can generate fluent Arabic text, their ability to reliably control readability levels remains unclear. We propose a multi-dimensional evaluation framework for Common European Framework of Reference for Language (CEFR)-controlled Arabic text generation, assessing whether instruction-following LLMs can serve as reliable generators for adaptive language learning. Our framework integrates controlled prompting, automatic readability prediction using a validated Taha-19 model, lexical constraint validation, and syntactic complexity profiling. Results show that structured prompting substantially improves CEFR alignment. In particular, CEFR-guided prompting with lexical constraints achieves the highest conformity to reference linguistic profiles (0.91 cosine similarity) and near-perfect agreement with predicted readability levels (0.99), while unconstrained prompting exhibits weak control. These findings establish an empirical foundation for integrating readability-aware Arabic text generation into adaptive educational systems.

[NLP-111] Look Before You Zoom: Adaptive Routing for the Resolution-Context Trade-off in Visual RAG

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在查询相关目标尺寸变小时性能显著下降的问题。现有无训练(training-free)方法通过动态检索并放大局部图像区域来提升小目标检测能力,但其普遍存在一个关键缺陷:忽视了分辨率与上下文信息之间的权衡(resolution-context trade-off)。具体而言,基于补丁的缩放虽能恢复小目标细节,但可能导致大对象被分割,破坏全局空间结构;而基于注意力的检索虽更利于保持大对象完整性,但在微小细节上仍不可靠;此外,当目标足够大时,直接采用全局感知反而效率更高。针对上述失效模式,本文提出轻量级框架ViRGo(Visual Retrieval or Global Perception),将视觉检索建模为自适应路由问题。ViRGo在首次前向传播中利用VLM内嵌的定位头估计目标尺度,并结合语义标记置信度,以极低额外计算开销决策选择全局感知、基于补丁的检索或基于注意力的检索。实验结果表明,ViRGo在多个视觉问答(VQA)基准和不同目标尺寸分组下均显著优化了准确率与效率的权衡:对小目标达到与补丁检索相当的精度,对大目标有效利用注意力检索,同时在无需缩放时直接路由至全局基线,大幅降低推理延迟。

链接: https://arxiv.org/abs/2606.21968
作者: Oanh N. Tran,Thanh Quoc Hung Le,Oscar Chew,Kuan-Hao Huang,Khoa D. Doan
机构: VinUni-Illinois Smart Health Center, VinUniversity (VinUniversity智能健康中心,越南大学); Texas AM University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) struggle as query-relevant objects become smaller. To address this, recent training-free approaches dynamically retrieve and zoom into local image regions. However, we show that indiscriminately applying retrieval ignores a critical vulnerability: the resolution-context trade-off. Patch-based zooming recovers details for small targets, but can split large objects and destroy global spatial context; attention-based retrieval better preserves large objects, but remains less reliable on tiny details; and global perception is often fastest when retrieval is unnecessary. Motivated by these failure modes, we introduce ViRGo (Visual Retrieval or Global Perception), a lightweight framework that formulates visual retrieval as an adaptive routing problem. ViRGo estimates object scale from the VLM’s intrinsic localization heads during the initial forward pass and combines it with semantic token confidence to select between global perception, patch-based retrieval, and attention-based retrieval with minimal additional computation. Experiments across multiple VQA benchmarks and object-size groups show that ViRGo improves the accuracy-efficiency trade-off: it matches patch retrieval on small details, leverages attention-based retrieval for larger objects, and reduces inference time by routing to the global baseline when zooming is unnecessary.

[NLP-112] OpenBioRQ: Unsolved Biomedical Research Questions for Agents

【速读】: 该论文旨在解决当前生成式AI在学术引用中的“虚假支持”问题,即模型生成的引用链接虽可解析,但所引文献实际上并不支持其陈述主张。现有基准测试未能捕捉到这一关键缺陷——当问题存在固定答案键时,模型可通过复制答案键对应的来源,而非独立验证该来源是否真正支持论点,从而导致评估结果失真。为此,作者提出\textbf{OpenBioRQ},这是一个基于检索的、面向12个生物医学领域的12,553个未解科学问题的代理型(agentic)基准测试,首次将开放性问题与多步工具调用(multi-step tool calling)相结合,并以真实后续证据作为开放性验证依据,而非依赖模型自身的参数化知识。该基准通过设定三款开源权重参考模型无法解答的问题作为难度锚点,确保难度具有实证基础而非主观判断。在最困难子集上,同源模型仅能解决约17%,而三个独立前沿代理模型(Gemini-3-Pro、Opus-4.7、GPT-5.5)表现跨度达29%-60%,表明该基准具备非饱和性(best agent仍有33%-40%未解)、高区分度及对能力层级的敏感性。此外,研究发现模型在最难问题上出现“代理崩溃”(agentic collapse)现象,即主动停止使用工具,甚至在禁用工具后性能几乎不变,揭示了工具使用失效恰恰发生在最需工具辅助的场景。引入每题固定的检查清单后,跨评估者一致性显著提升(Spearman相关系数从0.35升至0.82),凸显结构化流程对提升可信度的关键作用。因此,该工作的核心解决方案在于构建一个兼具真实性检验、实证难度定义与行为可观测性的开放生物医学推理基准,以推动生成式AI在复杂科研任务中实现更可靠、可验证的代理行为。

链接: https://arxiv.org/abs/2606.21959
作者: Minbyul Jeong
机构: Upstage AI(Upstage AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A working citation looks like proof – but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \textbf\openbiorq, a retrieval-grounded agentic benchmark of 12,553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting – where the model must issue multiple tool calls – with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model’s parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score – so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.

[NLP-113] Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer

【速读】: 该论文旨在解决现有跨语言迁移能力评估方法中存在的核心问题:当前衡量模型跨语言迁移强度的指标将源语言性能提升与跨语言迁移能力的提升混淆,导致评估结果失真。为此,作者提出一种更为可靠的评估指标——硬性调整迁移(Hardness Adjusted Transfer, HAT)得分,其关键在于通过引入任务难度的校正机制,有效分离源语言性能改进与真实跨语言迁移能力的变化,从而更准确地衡量模型从资源丰富语言向资源匮乏目标语言迁移的能力。基于HAT得分对20种多样化的语言模型及三大主流多语言基准的分析揭示了三个重要发现:1)小型模型的跨语言迁移能力并未失效;2)随着模型规模增大,跨语言迁移能力的提升速度低于预期;3)总体上,跨语言迁移能力随时间推移已有显著进步。

链接: https://arxiv.org/abs/2606.21954
作者: Prasoon Bajpai,Eleftheria Briakou,Colin Cherry,Preethi Jyothi,Vihari Piratla
机构: Google DeepMind(谷歌深度思维); Indian Institute of Technology Bombay(印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual transfer is a model’s ability to generalize capabilities from well-represented source languages to under-represented target languages. Existing measures of a model’s transfer strength conflate improvements in transfer with general improvements to accuracy in the source language. We advocate for an alternate metric that reliably captures transfer strength called Hardness Adjusted Transfer (HAT) Score, and use it to derive multiple insights on factors influencing transfer strength. Our analysis across twenty diverse language models and three popular mainstream multilingual benchmarks argues that 1) transfer in small models is not broken, 2) we are making slower than expected progress in cross-lingual transfer with model size, and 3) we have made clear progress over time.

[NLP-114] CapRiCorn-1K: A Comprehensive Benchmark for Video Captioning and Subject Referential Consistency Across Temporal Scales

【速读】: 该论文旨在解决现有视频字幕评估基准在跨多样化时长与场景下,难以客观且全面评价字幕生成质量及主体指代一致性的问题。当前主流模型在生成准确、完整且主体指代一致的视频字幕方面表现不佳,尤其随着视频时长增加,字幕整体质量与指代一致性均显著下降。其解决方案的关键在于提出一个名为CapRiCorn-1K的综合性评估基准,该基准支持音视频与仅视觉两种评估设置,能够覆盖长时程跨度与多领域视频场景,并通过设计与下游理解与生成任务性能高度相关的评估指标,有效衡量字幕的质量与主体指代一致性,从而推动视频字幕生成模型的精准化发展。

链接: https://arxiv.org/abs/2606.21949
作者: Xinlong Chen,Jiafu Tang,Yue Ding,Yizhuo Jia,Bozhou Li,Bohan Zeng,Yang Shi,Shihao Li,Yiyan Ji,Qiang Liu,Weihong Lin,Yuanxing Zhang,Pengfei Wan,Liang Wang,Tieniu Tan
机构: NLPR, CASIA (中国科学院自动化研究所); UCAS (中国科学院大学); Kling Team (快手技术); NJU (南京大学); FDU (复旦大学); PKU (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate and comprehensive video captions with consistent subject references are critical for downstream understanding and generation tasks. However, few existing benchmarks can objectively and comprehensively evaluate these properties across diverse durations and scenarios, thereby hindering the advancement of video captioning models. To bridge this gap, we propose CapRiCorn-1K, a comprehensive benchmark designed to evaluate both video captioning quality and subject referential consistency across long temporal horizons and diverse video domains. To accommodate varied evaluation needs, our benchmark supports both audiovisual and visual-only settings. Extensive experiments on CapRiCorn-1K reveal that current models generally struggle to generate accurate and comprehensive captions while maintaining consistent subject references. Moreover, as video duration increases, both the overall caption quality and subject referential consistency decline. Notably, our evaluation metrics exhibit strong correlations with the performance of downstream understanding and generation tasks conditioned on the generated captions, further validating their effectiveness. The project is available at this https URL .

[NLP-115] Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts ACL2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂道德推理与价值权衡场景中,现有评估方法仅依赖个体行为指标而无法捕捉模型内部价值优先级结构的系统性问题。其核心解决方案是提出一种基于Q方法论的对称式人-模型评估框架,通过让人类与模型对同一套140条道德陈述进行九列强制分布排序,以量化二者在价值结构层面的对齐程度。研究采用人类参考样本(N=35)构建了稳定的三因子参考几何结构,并对12个来自四个模型家族的LLM在两种温度设置下进行240次重复Q排序,利用Procrustes相似性(φ)和基于RSA的斯皮尔曼相关系数(ρ)定量分析结构对齐度。结果揭示了跨家族显著异质性、模型对生成随机性的敏感性及局部错位现象,表明全局得分优异可能掩盖深层区域扭曲。尽管基于排名与分桶的分析具有一致性,但提示语措辞仍引入显著差异。研究证明,评估价值结构对齐可为传统的逐项道德基准提供关键的结构性补充。

链接: https://arxiv.org/abs/2606.21939
作者: Jingting Zheng,Yuqi Ren,Linhao Yu,Yongqi Leng,Deyi Xiong(TJUNLP Lab, School of Computer Science and Technology, Tianjin University, Tianjin, China)
机构: TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China
类目: Computation and Language (cs.CL)
备注: 32 pages, 8 figures, 16 tables; accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in contexts requiring complex moral reasoning and value trade-offs. However, existing evaluations typically rely on item-level behavioral metrics, which fail to capture how models structurally prioritize competing values as a cohesive system. To address this, we propose a symmetric human-LLM evaluation framework, grounded in Q methodology, to measure value-structure alignment. Under our protocol, humans and models sort an identical 140-item moral statement set into a shared nine-column forced distribution; for LLMs, we elicit strict rankings and deterministically map them to Q-sort buckets. Using a human reference sample ( N=35 ), we establish a stable three-factor reference geometry specific to this instrument and sample. We evaluate 12 LLMs across four model families via 240 replicated Q-sorts at two temperature settings, quantifying structural alignment via Procrustes similarity ( \phi ) and RSA-based Spearman correlation ( \rho ). Our results reveal significant cross-family heterogeneity, model-specific sensitivity to generation stochasticity and localized misalignment, which demonstrate that favorable global scores can obscure underlying regional distortions. While rank- and bucket-based analyses remain highly consistent, prompt phrasing introduces notable variance. Ultimately, assessing value-structure alignment provides a crucial structural complement to traditional itemwise moral benchmarks.

[NLP-116] Latent Confidence Alignment for LLM Self-Assessment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在置信度校准(confidence calibration)评估中缺乏对题目难度建模的问题,导致难以准确解释置信度与实际准确率之间的偏差,进而无法判断模型的置信度是否反映其真实的元认知自评估能力,还是仅作为生成过程的副产品。其核心解决方案是引入基于Rasch模型的潜在能力框架与元认知视角,提出潜在置信度对齐误差(Latent Confidence Alignment Error, LCAE),用于衡量模型自我评估与由模型能力及题目难度共同隐含的错误概率之间的一致性。进一步地,该方法将题目难度作为外部信号,结合推理机制进行融合,从而提升模型自评估的质量。实验结果表明,该方法在不损害模型原有能力的前提下显著提升了置信度校准性能,并揭示了模型可靠性与推理开销之间的关联。

链接: https://arxiv.org/abs/2606.21937
作者: Ting-Yu Chen,Tingting Yu,Pei-Cing Huang,Chan Hsu,Ming-Yen Lin,Yihuang Kang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 2026 IEEE 27th International Conference on Information Reuse and Integration for Data Science

点击查看摘要

Abstract:Confidence calibration in large language models (LLMs) is commonly evaluated by comparing predicted confidence with observed accuracy. However, such approaches do not model item difficulty, making it difficult to interpret discrepancies and to determine whether model confidence reflects genuine self-assessment or is merely a byproduct of the response generation process. To address this, we adopt a Rasch model-based latent ability framework and a metacognitive perspective, and propose Latent Confidence Alignment Error (LCAE) to measure the consistency between model self-assessment and the latent error probability implied by model ability and item difficulty. We further incorporate item difficulty as an external signal with a reasoning mechanism. Experiments on a medical-domain dataset with 20 models show that the proposed approach improves self-assessment quality without affecting model ability, and reveals an association between reliability and inference cost.

[NLP-117] MindTailor: Personalized Emotional Support via Post History-Grounded Case Formulation and Collaborative Refinement

【速读】: 该论文旨在解决现有个性化情感支持系统中忽视用户过往经历对当前心理困扰形成影响的问题。当前方法多聚焦于捕捉用户当下的情绪状态、人格特征及情境背景,却未能充分考虑其历史心理经验对当前问题的塑造作用。为此,论文提出MindTailor框架,其核心创新在于通过分析用户的历史发帖记录构建案例构想(case formulation),并基于不同咨询策略的顾问代理(counselor agents)之间进行协作式批判性反馈,实现响应内容的迭代优化。该方案的关键在于引入历史感知(history-aware)的案例构建机制与多策略协同优化流程,使生成的情感支持更具深度共情、个性化和情境理解能力。为支持此项研究,作者构建了ReddiSupp数据集,包含798条来自Reddit的用户帖子及其对应的历史发帖序列,并通过大语言模型评估、专家人工评价及真实用户研究验证了该框架在共情度、个性化、理解力和整体偏好上的显著优势。

链接: https://arxiv.org/abs/2606.21930
作者: Suhyun Han,Kyunghyun Cho,JinYeong Bak
机构: Sungkyunkwan University (成均馆大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: 45 pages, 21 figures

点击查看摘要

Abstract:As mental health concerns continue to rise globally, social media has emerged as a vital space where individuals seek emotional support. While prior work on personalized emotional support has leveraged seekers’ emotional states, personas, and situational context, these approaches primarily capture the seeker’s current state, overlooking the formative experiences that shape present concerns. In this work, we propose MindTailor, a framework that generates personalized emotional support responses by constructing a case formulation from the seeker’s post history and iteratively refining responses through collaborative critique among counselor agents grounded in distinct counseling strategies. To enable research on this history-aware task, we construct ReddiSupp, a dataset of 798 Reddit posts paired with seekers’ prior post histories. Through LLM-as-a-Judge evaluation, expert human evaluation, and a user study with seekers, we demonstrate that MindTailor outperforms baselines across these evaluations, improving empathy, personalization, understanding, and achieving the highest overall preference.

[NLP-118] Pre-Generation Hallucination Detection in Large Language Models via Soft-Target Attention Probing

【速读】: 该论文旨在解决生成式模型在生成回答前难以准确识别幻觉(hallucination)风险的问题,从而无法在生成前采取规避措施(如拒绝生成、检索增强或路由决策),导致资源浪费与输出可靠性下降。现有方法通常将幻觉风险判断视为对单一解码输出的二分类任务,忽略了生成过程中的概率性与不确定性。本文提出将幻觉风险检测建模为一个连续的风险估计问题,并引入基于随机采样输出中经验答案错误率的软目标监督(soft-target supervision),该方法被证明是在模型采样分布下唯一无偏且方差最小的每提示错误概率估计器。此外,论文将注意力探针(attention probing)适配至生成前场景,使检测器能够选择性聚合与幻觉相关的关键提示表征。实验结果表明,在三个问答基准和五种模型上,该方法显著优于传统的线性探针,且使用软目标监督进一步一致提升了检测性能,关键在于通过更精确的风险建模与上下文敏感的特征选择实现生成前幻觉预警。

链接: https://arxiv.org/abs/2606.21917
作者: Amina Miftakhova,Alexey Zaytsev
机构: Applied AI Institute
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting hallucination risk before generation enables abstention, retrieval augmentation, and routing decisions without incurring the cost of decoding. While prior work has shown that such risk can be estimated from a model’s internal representations, existing approaches treat this as binary classification over a single decoded output. We instead formulate it as a risk-estimation problem. Under this formulation, we introduce soft-target supervision based on the empirical answer error rate over stochastically sampled outputs - an estimator we prove to be the unique unbiased minimum-variance estimator of the model’s per-prompt error probability under its sampling distribution. We further adapt attention probing to the pre-generation setting, enabling the detector to selectively aggregate hallucination-relevant prompt representations. Across three question-answering benchmarks and five models, attention probing outperforms linear probing on short-answer tasks. Replacing binary labels with soft-target supervision further and consistently improves detection quality. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.21917 [cs.CL] (or arXiv:2606.21917v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.21917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-119] Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归生成过程中依赖最终层输出所导致的推理偏差问题。传统方法假设深层表示能提供更可靠的下一个词预测,但研究发现,模型在中间层已具备充分的语义推理能力,而最终层可能因对齐偏好(alignment preference)引入冗余或泛化性过强的扰动,从而降低生成质量。其解决方案的关键在于提出一种无需训练的解码策略——置信度解码(Confident Decoding),通过熵引导的保守向后搜索动态选择最可靠的近最终层作为输出层,有效过滤晚期对齐带来的扰动。该方法被形式化为一个最优停止问题,在投影噪声有界且晚期扰动占主导的前提下,可保证性能损失相对于理想精炼层(oracle refinement layer)的上界。实验表明,该方法在密集型与专家混合(Mixture-of-Experts)架构的LLMs上均显著提升复杂推理任务(如GPQA-Diamond、Omni-MATH和HLE)的表现,且零内存开销、延迟增加不足2%,验证了动态跳过最终层扰动可释放对齐模型更强的推理潜力。

链接: https://arxiv.org/abs/2606.21906
作者: Xuanming Zhang,Sining Zhoubian,Yuxuan Chen,Tianyi Tang,An Yang,Sean Du,Chujie Zheng,Fei Huang,Dayiheng Liu,Gao Huang,Jingren Zhou
机构: Qwen Team, Alibaba Inc.(阿里巴巴集团); Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

[NLP-120] Olfactory-Inspired Sparse Combinatorial Coding for Low-Resource Named Entity Recognition

【速读】: 该论文旨在解决低资源语言中命名实体识别(NER)任务因标注数据稀缺和高质量预训练嵌入缺失而导致的性能瓶颈问题。其核心解决方案是提出一种受生物嗅觉系统启发的“受体-花球瓶颈”(receptor-glomerular bottleneck)架构,作为标准词元嵌入与BiLSTM-CRF序列模型之间的中间表示层。该架构通过模拟嗅觉系统中受体与花球的稀疏组合编码机制,引入结构化稀疏性作为先验知识(inductive bias),在极低数据量条件下有效抑制过拟合,起到强大的正则化作用。实验表明,在六种多语言数据集上,即使完全从零训练且不使用预训练嵌入,该架构在1,000句的极端低资源设定下仍能显著提升平均F1分数;尤其在孟加拉语等语言中,相比标准基线和通用瓶颈控制组分别实现+6.23%和+8.47%的显著提升,且在泰卢固语等超低资源场景下也表现出明显优势。此外,受体层自然涌现出稀疏专业化特征,验证了其生物学启发设计的有效性。研究结果表明,基于嗅觉网络的结构化稀疏编码可作为学习有限或噪声监督数据时的高效归纳偏置和正则化机制。

链接: https://arxiv.org/abs/2606.21895
作者: Bhushan Deshpande
机构: Independent Researcher, India (独立研究员,印度)
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Named Entity Recognition (NER) in low-resource languages suffers from limited supervision and a lack of high-quality pretrained embeddings. Biological olfaction, which relies on sparse combinatorial coding through receptor and glomerular organization, offers a compelling paradigm for learning robust representations under uncertainty. In this paper, we introduce a receptor-glomerular bottleneck - a novel, biologically-inspired olfactory architecture - between standard token embeddings and a BiLSTM-CRF sequence model. We evaluate our architecture across six multilingual datasets trained entirely from scratch (without pre-trained embeddings) under varied data-scale conditions, including a strict 1k-sentence low-resource control. Our results demonstrate that introducing a representation bottleneck yields F1 score improvements under severe data scarcity, primarily by acting as a powerful regularizer. Under the 1k capped training condition, at least one olfactory-inspired configuration achieves the highest mean F1 score across all six datasets. While these improvements represent near-ties with generic bottleneck controls for most languages, the olfactory architecture provides a significant advantage in languages like Bangla (+6.23% F1 over standard baseline and +8.47% F1 over the best control baseline) where generic bottlenecks degrade performance. We also observe improvements in the ultra-low-resource Telugu setting (+4.43% F1) at full-scale, and find that sparse specialization naturally emerges within the receptor layer. Our findings suggest that structured sparse coding inspired by olfactory networks serves as an effective inductive bias and regularizer when representations must be learned from limited or noisy supervision.

[NLP-121] Learning the ARTS of Search for Automated Discovery

【速读】: 该论文旨在解决科学发现中基于假设与实验迭代搜索过程的效率与有效性问题,特别是现有方法(如蒙特卡洛树搜索,MCTS)在评估假设时混淆了假设本身的价值与其实验执行质量,导致优质但初步执行不佳的假设被低估;同时,随着搜索进程推进,历史记录累积超出上下文窗口限制,迫使先前方法对搜索日志进行剪枝,从而丢失关键信息。其解决方案的关键在于提出一种名为“代理推理树搜索”(Agentic Reasoning for Tree Search, ARTS)的新框架,利用推理型语言模型(reasoning language model)主动分析过往实验日志,区分失败源于错误实现还是劣质假设,并据此智能选择下一阶段应优化的假设。为克服上下文长度限制,ARTS引入测试时训练(test-time training),将搜索树知识动态注入模型权重,实现高效记忆与决策。实验表明,在MLGym和MLEBench的22个任务上,ARTS相比主流算法平均提升15.3%的归一化得分;且经测试时训练后的Qwen3-4B代理可达到与Gemini-3 Pro、GPT-o3-reasoning等闭源前沿模型相当性能,推理成本降低高达5倍。更进一步,在部分可观测强化学习任务中,该训练策略使Qwen3-4B代理成功复现人类最优的递归记忆解法,而传统启发式方法因过早剪枝将其丢弃,凸显了该方法在长期依赖与复杂路径探索中的优势。

链接: https://arxiv.org/abs/2606.21891
作者: Gurusha Juneja,Arnav Kumar Jain,Deepak Nathani,William Yang Wang,Xin Eric Wang
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Université de Montréal(蒙特利尔大学) and Mila- Quebec AI Institute(魁北克人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific discovery can be formulated as an iterative search process over the space of hypotheses and experiments. Contemporary methods navigate this space using heuristics such as MCTS. These algorithms conflate the merit of a hypothesis with the quality of its experimental execution. A promising hypothesis with preliminary execution is therefore ranked below a modest hypothesis whose execution is refined. Moreover, prior methods prune the search logs as the search progresses because the accumulated history outgrows the context window. We propose Agentic Reasoning for Tree Search (ARTS), where we deploy a reasoning language model to navigate this space. The model inspects prior execution logs, diagnoses whether earlier failures arose from faulty implementations or bad hypotheses, and selects the hypothesis to build on next. To mitigate challenges with context length, ARTS uses test-time training to instill the knowledge of search tree in the model weights. Across 22 tasks from MLGym and MLEBench, we show that ARTS outperforms leading algorithms, with over 15.3% relative improvement in the normalized score. With test-time training we show that a Qwen3-4B agent can match performance with closed-source frontier models like Gemini-3 Pro and GPT o3-reasoning with upto 5x lower inference cost. We further observe that on partially observable RL tasks, the test-time trained Qwen3-4B scientist surpasses ARTS with the o3 scientist by rediscovering the human-best recurrent-memory solution that heuristic methods prune away.

[NLP-122] Scaling Performance and Low-Resource Annotation with Many-Shot In-Context Learning for Named Entity Recognition ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在命名实体识别(Named Entity Recognition, NER)任务中,尽管具备较强的上下文学习(In-context Learning, ICL)能力,但在结构化任务上仍显著落后于全监督微调模型(如微调后的BERT)的问题。现有研究多聚焦于少样本(few-shot)场景下的ICL,而对百数量级演示样本(many-shot)的潜力尚未充分探索。为此,论文提出通过大规模扩展上下文示例数量,系统评估多示例上下文学习在NER中的表现,并进一步探究其在低资源场景下作为数据标注框架的有效性。解决方案的关键在于:利用数百个标注示例进行上下文学习,使LLMs能够达到甚至超越全监督BERT模型的性能;同时,仅需约一百个人工标注样例作为示范,即可通过多示例上下文学习生成高质量标注数据,用于微调BERT,在低资源NER任务上实现相较于现有最先进方法约10%的绝对F1值提升。

链接: https://arxiv.org/abs/2606.21890
作者: Qi Zhang,Fangping Lan,Cornelia Caragea,Longin Jan Latecki,Eduard Dragut
机构: Temple University (坦普尔大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:In-context learning (ICL) with large language models (LLMs) has emerged as a powerful alternative to fine-tuning for Named Entity Recognition (NER), achieving strong performance with minimal annotation and no additional training. However, prior work has shown that despite their adaptability, LLMs still lag behind fully supervised models such as fine-tuned BERT in structured tasks like NER. While existing studies on ICL for NER have mainly explored few-shot settings, the potential of scaling to hundreds of demonstrations has not been thoroughly investigated. To address this gap, we conduct a comprehensive investigation of many-shot ICL for NER and further explore its effectiveness in annotating and refining data for low-resource NER tasks. Specifically, we evaluate various LLMs across multiple domains using hundreds of ICL examples and then assess the feasibility of using many-shot ICL as a data annotation framework. Our experiments demonstrate that: (1) scaling to hundreds of in-context examples enables LLMs to match or even surpass the performance of fully supervised BERT models; and (2) using about one hundred human-labeled examples as demonstrations, many-shot in-context annotation can generate high-quality labeled data, leading to approximately 10% absolute F1 improvement over existing state-of-the-art approaches when used to fine-tune BERT on low-resource NER.

[NLP-123] A Verifiable Search Is Not a Learnable Chain-of-Thought

【速读】: 该论文旨在解决生成式模型在处理某些特定类型推理任务时,尽管具备足够能力完成计算与判断,却无法通过标准的“思维链”(Chain-of-Thought, CoT)方式进行有效学习的问题。具体而言,研究发现当任务依赖于对无信息结构(information-free structure)的回溯搜索(backtracking search)时,即使模型能够正确执行每一步算术运算并识别出正确解的候选项,也无法将整个搜索过程以左到右的前向推导形式建模。其关键问题在于:这类任务的本质是全局搜索而非局部可验证的递进推理,因此不存在一个忠实反映真实求解路径的前向思维链可供模仿。解决方案的核心在于将原本需要在线搜索的任务重构为“预计算+记忆+验证”的模式——即预先生成所有可能解的组合并构建索引(catalog),使模型仅需通过检索和验证即可完成任务。实验表明,这一方法显著提升性能(如在加密算术题上从0.03跃升至0.57),且在不同模型规模(3B–671B)和训练方式下均具鲁棒性。最终证明,模型所“学习”的并非搜索过程本身,而是记忆与验证机制,揭示了当前基于微调或提示的思维链范式在面对结构性搜索任务时的根本局限。

链接: https://arxiv.org/abs/2606.21884
作者: Harsh Patel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 6 figures, 16 tables; Interactive walkthrough: this https URL ; Code, solvers, and per-row eval data: this https URL

点击查看摘要

Abstract:It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time (“verdict-as-token”). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure’s only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.

[NLP-124] he Language-Energy Divide: Measuring Energy Costs of Multilingual LLM Inference

【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)在实际部署中存在显著能源消耗差异的问题,尤其关注不同语言间推理能耗的不均衡现象。研究发现,单位输出词元(token)的能耗在不同语言间最高可相差8.3倍,而相同请求集的总能耗在英语(17.6 kJ)与普什图语(3,147 kJ)之间甚至相差179倍,揭示了严重的能源不平等。其关键解决方案在于识别并量化两个加剧能耗差异的核心因素:一是使用复杂或罕见书写系统的语言具有更高的每词元能耗;二是低资源语言通常需要生成更多词元以完成任务。此外,研究还发现高能耗语言往往伴随更低的任务准确率,形成“能耗-性能双重惩罚”。这一现象在不同模型、硬件平台和任务类型中均持续存在,表明能源不平等具有系统性。因此,论文呼吁将能源效率作为评估模型性能的首要维度,推动在模型卡片和报告清单中纳入能源指标,并在部署端采用能效优化策略以缓解该问题。

链接: https://arxiv.org/abs/2606.21869
作者: Naihao Deng,Alissa Shen,Yiming Feng,Joan Nwatu,Jae-Won Chung,Mosharaf Chowdhury,Yulong Chen,Rada Mihalcea
机构: University of Michigan(密歇根大学); University of Cambridge(剑桥大学); University of Aberdeen(阿伯丁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multilingual settings, yet the energy costs of serving these models across different languages remain poorly understood. We present a systematic study of inference energy consumption across languages with this http URL framework (Chung et al., 2026). We find striking disparities: energy consumption per output token varies by up to 8.3 times across languages, while total energy for a fixed set of requests varies by up to 179 times between the cheapest (English, 17.6 kJ) and the most expensive (Pashto, 3,147 kJ) languages. Our analysis shows that this disparity is driven by two compounding factors: (1) higher per-token energy costs for languages using complex or rare scripts, and (2) more tokens generated for low-resource languages. Moreover, we find a double cost + performance penalty: languages with the highest energy footprints also tend to achieve the lowest task accuracy. We reveal that the energy divide persists across models, hardware, and tasks, suggesting a systemic energy inequity in multilingual LLM deployment. Finally, we recommend that the community treat energy as a first-class evaluation axis, extend reporting checklists and model cards to include it, and adopt deployment-side mitigations for better energy efficiency.

[NLP-125] ForEx: A Formal Verification Framework for Explainable Reasoning in Logical Fallacy Detection and Annotation

【速读】: 该论文旨在解决当前大型语言模型(LLM)在逻辑谬误检测任务中仅关注预测标签而忽视模型生成解释的推理有效性这一关键问题。现有评估方法无法判断模型所给出的解释是否真正支持其预测结论,导致对模型真实推理能力的误判。为此,论文提出ForEx(Formal Verification for Explainable Reasoning)框架,其核心在于将LLM生成的自然语言解释转化为可形式化验证的Lean4逻辑表达,并基于编码的前提条件检验该推理链是否可被形式推导,而非直接评估原自然语言论证的逻辑有效性。为区分预测结果与支撑推理的形式正确性,研究引入了LLM论证验证矩阵(LLM Argument Verification Matrix),实现标签一致性与形式验证状态的解耦分析。在LOGIC-Climate数据集上的实验表明,超过90%的LLM输出可成功形式化并通过验证,但与人工标注的一致性仅为约20%,揭示出形式可推导性与标签一致性之间存在系统性差距,这一差异在传统基于预测准确率的评估指标下无法察觉。因此,ForEx的关键突破在于将大模型评估从单纯依赖标签正确性的范式,推进至可机器验证的形式化推理链分析层面,从而更深入地揭示模型的真实推理能力。

链接: https://arxiv.org/abs/2606.21867
作者: Pei-Cing Huang,Chienyu Liu,Chan Hsu,Ci-Siang Chen,Pei-Ju Lee,Yihuang Kang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 2026 IEEE 27th International Conference on Information Reuse and Integration for Data Science

点击查看摘要

Abstract:Current evaluations of Large Language Models (LLMs) on logical fallacy detection focus on predicted labels, but do not establish whether those labels are supported by the reasoning the models provide. We propose ForEx (Formal Verification for Explainable Reasoning), a framework that translates LLM-generated explanations into Lean4 and verifies whether the translated rationale is derivable under encoded premises, not the logical validity of the original natural language argument. To distinguish prediction outcomes from the formal status of the supporting reasoning, we introduce the LLM Argument Verification Matrix, which separates label consistency from formal verification status. Experiments on LOGIC-Climate show that over 90% of LLM outputs can be translated into formal reasoning chains that pass verification, while agreement with human annotations remains around 20%. These results expose a systematic gap between formal derivability and label agreement, a distinction invisible to prediction-based metrics. ForEx moves LLM evaluation beyond label correctness toward machine-checkable analysis of formalized reasoning chains.

[NLP-126] ALAS: Teacher-Anchored Layer Alignment with Adaptive Sharpness-Aware Minimization for Embedding Distillation ACL2026

【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在压缩大规模预训练语言模型时面临的两大核心挑战:一是现有方法强制学生模型严格模仿教师模型的句向量或内部特征,导致计算开销过大;二是由于学生与教师模型之间固有的容量差异,导致性能提升受限。其解决方案的关键在于提出一种统一框架TALAS(Teacher-Anchored Layer Alignment with Sharpness-aware minimization),通过三个核心机制实现高效且鲁棒的知识迁移:首先,引入教师锚定(Teacher-Anchored)机制,仅将教师的最终句向量蒸馏至学生模型的上层,有效降低计算负担并符合容量约束;其次,采用分层对齐自蒸馏(Layer-Aligned Self-Distillation),通过嵌入空间中的几何关系约束实现自顶向下的语义信息传递,弥合低层语义鸿沟;最后,集成自适应尖锐度感知最小化(Adaptive Sharpness-Aware Minimization, ASAM),引导模型优化至平坦极小值区域,抑制对教师噪声的过拟合,从而提升泛化能力。实验结果表明,TALAS在标准句向量基准上显著优于现有强基线方法,在性能与训练效率(计算成本与内存占用)方面均表现出优越性。

链接: https://arxiv.org/abs/2606.21851
作者: Quoc Phong Dao,Hoang Son Nguyen,Pham Khanh Chi,Linh Ngo Van,Nguyen Thi Ngoc Diep,Thien Huu Nguyen,Trung Le
机构: Hanoi University of Science and Technology (河内科技大学); VNU University of Engineering and Technology (越南国家大学工程与技术学院); University of Oregon (俄勒冈大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Knowledge Distillation (KD) has established itself as a pivotal technique for compressing large pre-trained language models. However, existing methods that force a student to strictly mimic the teacher’s sentence embeddings or internal features often incur prohibitive computational costs and yield suboptimal performance due to the inherent capacity gap. To address these challenges, we propose TALAS (Teacher-Anchored Layer Alignment with Sharpness-aware minimization), a unified framework that synergizes hierarchical (multi-layer) alignment with robust optimization. First, we introduce a Teacher-Anchored mechanism that selectively distills final sentence embeddings only into the student’s upper layers, thereby reducing overhead while respecting capacity constraints. Second, we bridge the semantic gap in lower layers via Layer-Aligned Self-Distillation, which propagates knowledge top-down using internal geometric relational constraints in the embedding space. Finally, to prevent the student from memorizing point-wise teacher noise, we integrate Adaptive Sharpness-Aware Minimization (ASAM) into the training objective, guiding the model towards flat minima for enhanced generalization. Empirical results on standard sentence embedding benchmarks demonstrate that TALAS consistently outperforms strong distillation baselines while achieving superior training efficiency in terms of computational cost and memory footprint.

[NLP-127] Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers

【速读】: 该论文旨在解决标准注意力机制(Standard Attention)中键值(Key-Value, KV)缓存带来的内存占用高和访问开销大的问题,尤其在大模型推理过程中,KV缓存占据了显著的显存资源并限制了吞吐量。其核心解决方案是提出无键注意力(Keyless Attention),通过完全移除键投影(key projection),仅基于查询(query)与值(value)进行计算,从而实现仅值缓存(Value-Only Cache),将KV缓存的内存占用和访问开销降低50%。该方法的关键在于引入深度- m 注意力分解(Depth-m Attention Factorization),其中当m=3时,通过一个值空间路由矩阵(value-space routing matrix)替代传统的键投影,不仅保持了与标准注意力相当的参数量(如投影矩阵数量),还实现了路由与检索之间的耦合,增强了表示能力。实验结果表明,在五个不同模型和四种架构上,Keyless Attention在4/5模型中达到或优于标准注意力的困惑度(perplexity),并在零样本推理任务中于4/5个常识推理基准上表现更优,同时全程保持50%的缓存压缩优势。

链接: https://arxiv.org/abs/2606.21848
作者: Xin Gao
机构: York University ( York 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:We propose Keyless Attention, an attention mechanism that eliminates the key projection entirely, operating over queries and values only. This yields a Value-Only Cache that reduces KV cache memory and access overhead by exactly 50% over standard attention, while matching or exceeding standard attention’s decode throughput. Beyond efficiency, we introduce Depth- m Attention Factorization: standard attention computes a depth-2 factorization of the attention bilinear form, while Keyless Attention realizes a depth- m instance of this family. At m=3, Keyless Attention matches the projection matrix count of standard attention via a value-space routing matrix that replaces the key projection and introduces a coupling between routing and retrieval. Experiments across five models and four architectures (GPT-2 280M, GPT-2 557M, Pythia 410M, Qwen2 1.5B, and Llama 3.2 1B) show that Keyless Attention matches or outperforms standard QKV attention on perplexity in 4 out of 5 models. On downstream zero-shot evaluation (GPT-2 557M), Keyless Attention outperforms on 4 out of 5 commonsense reasoning benchmarks, while achieving 50% KV cache reduction throughout.

[NLP-128] Inverse Turing Bench: Evaluating Language Models as Judges of Human vs. AI Dialogue

【速读】: 该论文旨在解决生成式 AI 在在线对话中与人类行为难以区分的问题,提出了一种名为“逆图灵测试基准(Inverse Turing Bench)”的评估体系,用于衡量大语言模型(LLM)在多轮文本对话中识别人类-人类对话与人机对话的能力。其核心挑战在于:如何让AI系统具备对自身与人类交互模式差异的敏感性,从而实现准确区分。解决方案的关键在于构建一个成对的对话数据集,其中每一对包含一段纯人类对话和一段人机对话,要求模型判断哪一个是纯人类对话。实验结果表明,尽管当前最先进的模型如GPTZero、Claude Opus-4.6和GPT-5.5在该任务上分别达到89.41%、77.92%和75.94%的准确率,但现有方法仍存在明显局限——基于统计特征的检测方法存在语义盲区,而依赖语义理解的方法则易受提示工程(persona-prompting)干扰。研究进一步指出,该任务可作为探测大模型“心智理论”(Theory of Mind)能力的重要工具,并强调人机区分能力应成为未来智能系统的核心功能之一。

链接: https://arxiv.org/abs/2606.21844
作者: William Hager,Ishika Rathi,Masum Hasan,Cameron Jones
机构: University of Rochester(罗切斯特大学); Stony Brook University(石溪大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As AI systems integrate into online spaces, differentiating them from humans in conversations is increasingly important. We present Inverse Turing Bench, a benchmark that evaluates LLMs and other models on their ability to differentiate humans and AI in multi-turn text. The benchmark provides a collection of paired dialogue transcripts, wherein one dialogue is between two humans and the other is between a human and an AI. The task is to correctly identify which dialogue is human-only vs. human-AI. We evaluated a preliminary set of models against this benchmark, and found that GPTZero, Claude Opus-4.6, and GPT-5.5 achieve the highest accuracy: 89.41%, 77.92%, and 75.94% respectively. Our results suggest that statistical approaches to detection have semantic blind spots, but semantic approaches are susceptible to persona-prompting. Our work speaks to the Inverse Turing Test as a probe of LLM theory of mind, and motivates human-AI differentiation as a critical capability for AI systems. Our live benchmark can be found at this https URL (anonymity preserved).

[NLP-129] Measuring What Persists: Conditioning Mechanisms and a Geometric Framework for AI Agent Identity

【速读】: 该论文旨在解决长上下文应用中人工智能代理(AI agent)在运行过程中发生身份漂移(identity drift)的问题,即代理逐渐偏离其初始设定的身份特征。现有方法仅能在行为质量出现明显退化后才能检测到此类漂移,缺乏早期预警能力。其核心解决方案是构建一个基于几何的理论框架,利用平方根互信息散度(JSD\sqrt{\mathrm{JSD}})度量空间与富足范畴论中的数量同调(magnitude homology),将身份视为非测地结构(non-geodesic structure),而身份漂移则被建模为该结构向测地方向松弛的过程。该框架通过实证发现一种双机制条件结构:跨条件距离揭示了“身份真空”簇(identity-vacuum cluster),表明身份规范填补了行为上的空缺;以及“安全盆地”簇(safety-basin cluster),显示身份从后训练吸引子中被排斥。等边探针基线实验验证了身份规范可产生可观测的行为丰富性(最大分离时达55种独特响应模式,远超基线模型的1种)。基于对称性 SnS_n 的一阶微扰理论进一步预测,仅由周长变化即可决定数量变化,形状扰动因对称性被一阶抵消,且该公式在观测扰动幅度下具有自洽性。然而后续漂移实验表明,所观察到的数量下降实际源于重复填充(repetitive-padding)伪影,并非真实上下文长度导致的漂移;多样化填充在长达15万标记符内均未引起可测量的形变。尽管该数量同调框架具备诊断各向异性收缩与结构坍缩的潜力(通过同调简化实现),其架构基础依赖于微扰理论与选择规则,但尚未得到充分实证确认。

链接: https://arxiv.org/abs/2606.21843
作者: Andrew Tanner
机构: Anisotrope AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 6 figures, 8 tables

点击查看摘要

Abstract:AI agents in long-context applications drift from their specified identity. Current methods detect this only after qualitative degradation is visible. We present a geometric framework for measuring identity structure using \sqrt\mathrmJSD metric spaces and magnitude homology from enriched category theory, where identity is non-geodesic structure and drift is its relaxation toward the geodesic. Validated on a persistent AI agent, the framework’s strongest empirical finding is a two-mechanism conditioning structure: cross-condition distances reveal an identity-vacuum cluster where the identity specification fills a behavioral void, and a safety-basin cluster where it displaces from post-training attractors. An equilateral probe baseline confirms that the identity specification creates measurable behavioral richness (55 unique response patterns vs. 1 for the base model) at maximum probe separation. A first-order perturbation theory for equilateral configurations predicts magnitude changes from perimeter changes alone, with shape perturbations first-order cancelled by the S_n symmetry; the formula is self-consistent at the observed perturbation amplitudes. A drift experiment measuring magnitude decrease under context pressure was subsequently found to reflect repetitive-padding artifacts rather than genuine context-length drift; diverse padding produces no measurable deformation through 150K tokens. The magnitude homology framework’s full diagnostic promise – detecting anisotropic contraction and structural collapse via homological simplification – is architecturally grounded in the perturbation theory and selection rules but remains empirically unconfirmed. Comments: 29 pages, 6 figures, 8 tables Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 18F75, 55N31, 62H30 ACMclasses: I.2.0; I.2.11 Cite as: arXiv:2606.21843 [cs.AI] (or arXiv:2606.21843v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.21843 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-130] Local Causal Attribution of Chain-of-Thought Reasoning ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在链式思维(Chain-of-Thought, CoT)过程中内在因果结构不透明的问题,以提升模型的可解释性与安全性。其核心挑战在于理解不同思维单元(unit)之间的因果关系及其对后续输出生成概率的影响。为此,作者提出了一种名为AttriCoT的黑箱方法,关键创新在于构建一个基于思维单元的结构化因果模型(Structural Causal Model, SCM),并通过仅需 O(U)O(U) 次前向传播(forward passes)来估计各单元的重要性参数,从而实现高效且高保真度的归因。实验在5个数据集和4种推理模型上验证了该方法在扰动曲线上的优越表现,表明其归因结果更忠实于模型实际行为,同时揭示了不同模型与领域间在思维结构上的显著差异。

链接: https://arxiv.org/abs/2606.21821
作者: Dennis Wei,Yannis Belkhiter,Erik Miehling,Radu Marinescu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Camera-ready version for the Mechanistic Interpretability Workshop at ICML 2026. 37 pages, 18 figures

点击查看摘要

Abstract:Understanding the causal structure of a language model’s thought process is a problem of significant importance for both transparency and safety. In this work, we take a local approach toward this goal by analyzing the causal relationships among individual components, termed units, of a given, specific chain-of-thought trace. We construct a structural causal model on these units and relate each unit to the log probability of generating (subsequent) output units. Our algorithm, termed AttriCoT, is a black-box method that performs attribution by estimating importance parameters in the structural causal model using O(U) forward passes through the model, where U is the number of units. Evaluation of perturbation curves across 5 datasets and 4 reasoning models shows that AttriCoT produces attributions that are more faithful to the model’s behavior than alternative methods. The attribution results also reveal notable differences in thought structure between models and domains.

[NLP-131] Generating Public Health Responses using Survey-Augmented Large Language Models

【速读】: 该论文旨在解决流行病学模型中依赖大规模重复调查数据来刻画个体健康决策行为(如是否接种疫苗或采取防护措施)所面临的成本高、耗时长及情景覆盖有限的问题。其核心解决方案是探索生成式人工智能(Generative AI)中的大语言模型(Large Language Models, LLMs)是否能够生成具有真实人群特征的合成调查回应,以补充或替代传统调查数据。关键在于采用基于聚类分析识别出对疫苗持积极或消极态度的群体,并据此设计结构化提示(cluster-informed prompting),引导LLMs在多个疫情波次中生成符合人口统计特征、疫苗信念、风险感知与健康行为分布的合成数据。结果显示,合成数据整体上能较好再现真实数据中的单变量分布模式,但在捕捉个体内部多因素协同变化方面表现较弱;部分模型在群体层面的疫苗接种趋势模拟上更为可靠,但跨波次性能存在差异。此外,通过训练分类器仍可有效区分真实与合成记录,表明生成数据仍具有可识别的合成痕迹。因此,尽管LLM生成的合成调查数据可作为探索性数据增强工具,支持基于代理的流行病建模,但其尚不足以替代真实调查数据,需进一步优化方法并开展严格验证。

链接: https://arxiv.org/abs/2606.21820
作者: Leonardo Marciaga,Thuyen Pham,Julia Rezvani,Alina Hyk,Chunyang Liao,Konstantinos Mitsopoulos,Raffaele Vardavas
机构: Illinois Institute of Technology (伊利诺伊理工学院); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Portland State University (波特兰州立大学); Oregon State University (俄勒冈州立大学); University of California Los Angeles (加州大学洛杉矶分校); Johns Hopkins University (约翰霍普金斯大学); Causal Paths Analytics LLC (因果路径分析有限责任公司)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Epidemiological models often rely on survey data to represent how individuals make health-related decisions, such as whether to vaccinate or adopt protective behaviors. However, repeated large-scale surveys are costly, time-consuming, and limited in the range of scenarios they can capture. In this work, we investigate whether large language models (LLMs) can generate synthetic survey responses that reproduce patterns observed in real populations. Using longitudinal data from the FluPaths surveys, we first identify groups associated with broadly positive or negative attitudes toward vaccination through clustering analysis. We then evaluate several LLMs using a cluster-informed prompting approach to generate synthetic survey responses across multiple epidemic waves. Across models, the synthetic data generally reproduce the distributions of demographic characteristics, vaccination-related beliefs, risk perceptions, and health behaviors observed in the survey data. However, they are less successful at capturing how these factors vary together within respondents. Some models reproduce group-level vaccination trends more reliably than others, although performance varies across waves. We also trained a classifier to distinguish real from synthetic records and found that the generated responses remained identifiable as synthetic. Overall, our findings suggest that LLM-generated survey data may provide a useful tool for exploratory data augmentation and we hope that it could support agent-based epidemic modeling approaches. However, the generated data should not be treated as a substitute for human survey data without further methodological improvements and validation.

[NLP-132] Fixed RAG Compression Collapses Measured Reader Scaling

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中压缩评估方法的潜在偏差问题,即现有研究通常在少数阅读器(readers)上评估压缩器,并假设压缩后的证据层对模型性能评估无影响。然而,研究发现这一假设不成立:固定压缩策略虽可能提升平均准确率,但会掩盖阅读器升级带来的性能改进,甚至颠倒不同模型间的性能排名。在涵盖20个阅读器和10种领域-方法组合、基于四个问答基准与一个摘要基准的实验中,压缩增益随阅读器基线性能提升而下降(十组中有九组具有统计显著性,p < 0.05)。例如,通用摘要方法在LongMemEval-S上改变了31%的模型配对排名,而固定的HotpotQA压缩器则隐藏了从Qwen 7B到GPT-4.1-mini升级所带来的80%原始性能提升。这种矛盾现象由两个相互对抗的作用机制解释:压缩通过去除弱阅读器无法过滤的噪声而提升其表现,却因丢弃强阅读器本可利用的细节信息而损害其性能。该模式在结构化编译、通用摘要、三类训练压缩器、查询聚焦摘要以及对九篇已发表压缩论文的外部审计中均一致出现。为此,作者发布了ragscale工具包,基于17.7万条行级压缩转换数据,使任何压缩研究可在一天内仅用三个阅读器完成对阅读器扩展效应的全面审计。

链接: https://arxiv.org/abs/2606.21807
作者: Sugam Panthi,Rabab Abdelfattah
机构: The University of Southern Mississippi
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) compression papers often evaluate a compressor on one to three readers and treat the compressed evidence layer as evaluation-neutral. We show this assumption is false: fixed compression can raise average accuracy while hiding reader upgrades and reversing model rankings. Across 20 readers and ten domain-method settings over four QA benchmarks and one summarization benchmark, compression gain decreases with reader baseline (nine of ten settings significant, p 0.05). Generic summarization flips 31% of pairwise model rankings on LongMemEval-S, and a fixed HotpotQA compressor hides 80% of the raw upgrade from Qwen 7B to GPT-4.1-mini. Two opposing forces explain this paradox: compression rescues weak readers by removing noise they cannot filter, and harms strong readers by dropping details they would have used. The pattern appears across structured compilation, generic summarization, three trained compressor families, query-focused summarization, and an external audit of nine published compression papers. We release ragscale, a toolkit built on 177,000 row-level compression transitions, so any compression paper can audit reader scaling with three readers in one day.

[NLP-133] Is Agent Code Less Maintainable Than Human Code?

【速读】: 该论文旨在解决生成式AI在软件开发中代码可维护性(maintainability)不足的问题,尤其关注当后续开发代理(agent)基于已有代码进行迭代时,由初始代理生成代码所引发的潜在累积性维护问题。其核心挑战在于:尽管当前编码代理在单一任务上表现优异,但其生成代码在长期协作开发中的可维护性尚未被充分评估,可能对后续开发造成不可预见的负面影响。解决方案的关键在于提出CodeThread框架,该框架通过从仓库级编码基准数据集中构建受控实验,系统性地比较人类代码与代理生成代码在维护场景下的表现差异。研究发现,代理代码在后续任务中的可解析率较人类代码平均下降13.1%,且传统可维护性指标无法解释这一差距。真正关键的差异体现在更细微的行为特征上,如输入验证与错误处理逻辑的变化,以及下游代码规模和任务复杂度的差异。这表明,评估编码代理不仅需关注即时任务完成率,更应纳入代码可维护性维度,并揭示了代理生成代码可能引入下游错误的潜在根源。

链接: https://arxiv.org/abs/2606.21804
作者: Shaswat Patel,Betty Li Hou,Arun Purohit,Kai Xu,Jane Pan,He He,Valerie Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Maintainability is a core dimension of software engineering, shaping how code is written, reviewed, and developed over time. While coding agents have demonstrated strong performance on single-issue tasks, it remains unclear how maintainable their code is when future agents build on top of it, potentially leading to compounding downstream effects. We investigate how agent code compares to human code in these maintenance settings, presenting CodeThread, a framework to construct controlled experiments from repository-level coding benchmarks. Applying CodeThread to four frontier coding agents and four benchmarks, we find that agents are less effective at resolving tasks when building on agent code compared to human code, with task resolve rate drops of up to 13.1%. Regression analysis reveals that many traditional software engineering maintainability metrics do not explain this difference. Instead, the clearest signals are subtler behavioral differences in agent code, such as changes to input validation and error handling, along with differences in downstream code size and task difficulty. These findings highlight the need to evaluate these systems not only by immediate task resolution but also by code maintainability, and point to potential sources of downstream errors introduced by agent code.

[NLP-134] st-Time Training with Next-Token Prediction

【速读】: 该论文旨在解决预训练长上下文语言模型在测试时训练(Test-Time Training, TTT)中如何有效设计内循环目标函数的问题。现有方法虽可通过快速权重适应(fast-weight adaptation)实现对已发布大模型检查点的即插即用式微调,但其核心挑战在于:每次测试时的“写入”操作应存储何种信息?当前主流方案通常采用学习到的局部值代理(local value proxy)进行监督,但这些代理与自监督的下一步词预测(next-token prediction, NTP)信号缺乏直接关联,导致优化方向偏离模型原始训练目标。本文提出一种名为TTT-NTP的新方法,其关键创新在于将每个测试时的局部写入操作直接以模型自身的下一步上下文隐藏状态作为监督信号,从而确保快速权重更新遵循与原生NTP一致的因果计算路径——即目标值为单个下一位置上下文状态的逐点线性投影。该设计使快写机制与模型内在的自监督学习目标对齐。实验结果表明,在RULER Full-13基准上,TTT-NTP是唯一能在四个不同模型家族、覆盖0.6–8B参数规模的多种架构中持续提升性能的方法;在真实世界长文档问答基准LongBench-v2上,也显著优于基础模型,同时保持常识推理与知识保持能力。

链接: https://arxiv.org/abs/2606.21803
作者: Xuan Ouyang,Zefan Cai,Junjie Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 2 figures, 7 tables. Preprint

点击查看摘要

Abstract:Next-token prediction is the self-supervised signal that trains language models, and every observed prompt token provides the same signal at test time. We study whether this signal can define the inner-loop objective for test-time training (TTT) in pretrained long-context language models. Many TTT architectures require models to be trained with test-time adaptation in mind, limiting their direct applicability to released LLM checkpoints. While recent in-place TTT methods make fast-weight adaptation possible for pretrained LLMs without redesigning the backbone, they leave a central question unresolved: what should each test-time write store? Existing recipes train the fast weight to match a learned local value proxy but they are not directly tied to the self-supervised next-token prediction signal. We introduce Test-Time Training with Next-Token Prediction (TTT-NTP), a drop-in fast-weight adaptation method for pretrained LLMs that instead supervises updates using the model’s own next contextual hidden state. This makes each local write follow the same causal computation that supports next-token prediction: the value target is a pointwise linear projection of a single next-position contextual state. On RULER Full-13 (averaged over 4k, 8k, 16k, and 32k context lengths), TTT-NTP is the only method that consistently improves the released backbone across four models spanning three families and a 0.6–8B size range: Llama-3.1-8B (+3.9), Mistral-7B-v0.3 (+3.0), and the Qwen3 series (Qwen3-4B +4.1, Qwen3-0.6B +2.9). On the real-world LongBench-v2 long-document QA benchmark, TTT-NTP improves over the base model on both Llama-3.1-8B (+5.6) and Mistral-7B-v0.3 (+3.7), while preserving commonsense and knowledge performance.

[NLP-135] When to Plan When to Polish: Noise Level as a Granularity Axis for Diffusion Language Models

【速读】: 该论文旨在解决标准逐标记(tokenwise)扩散语言模型(LM)在去噪过程中始终以标记粒度进行训练扰动与推理承诺所导致的早期结构不连贯问题。高噪声阶段下,这种细粒度处理会产生零散的局部片段,难以形成连贯的初始语义骨架,而这一能力正是规划敏感型生成任务所必需的。现有层次化规划方法虽能通过引入粗粒度阶段实现规划与表述分离,但通常依赖额外规划器、块级潜在变量或两阶段设计,增加了复杂性。本文提出噪声依赖粒度控制(Noise Dependent Granularity Control, NDGC),一种单层级扩散方法,其核心在于将噪声水平作为粒度调控信号:在高噪声阶段使用连贯的标记组(coherent token groups)支持早期语义承诺,而在低噪声阶段回归至细粒度优化。该机制使训练暴露与推理承诺随去噪进程动态对齐,实现了类似规划的由粗到精的去噪过程,无需显式规划器或分层架构。在受控实验、消融分析及WritingPrompts测试中,NDGC均表现出更早的骨架形成、更有序的恢复路径以及更健康的生成输出。

链接: https://arxiv.org/abs/2606.21802
作者: Peihong Li,Yuanjie Shi,Yan Yan
机构: Washington State University (华盛顿州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Standard tokenwise diffusion LMs keep training corruption and inference commitment at token granularity throughout denoising. At high noise, this leaves scattered local fragments rather than coherent evidence, making it hard to form early coarse structure, exactly what planning-sensitive generation requires. Hierarchical planning methods add coarse stages to separate planning from wording, but they need extra planners, block latents, or two stage designs. We propose Noise Dependent Granularity Control (NDGC), a single-level diffusion method that uses the noise level as a granularity cue. NDGC aligns training exposure and inference commitment with denoising progress. High noise steps use coherent token groups to support early meaning commitment, while low noise steps return to token level refinement. This creates planning like coarse to fine denoising without an explicit planner or hierarchical architecture. Across controlled tests, ablations, and WritingPrompts, NDGC shows earlier skeleton formation, better ordered recovery, and healthier outputs.

[NLP-136] CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在知识密集型问答任务中因缺乏对当前回答不确定性、支持度或完整性准确判断而导致的两大失效模式:一是过度自信地给出无证据支持的答案,降低准确性;二是当已有证据已足够时仍持续进行冗余检索,造成计算资源浪费。其解决方案的关键在于引入校准验证器遥测(Calibrated Verifier Telemetry, CalVerT),通过向代理状态中注入两项关键遥测信号——经过校准的自信心评分与证据支撑验证器评分,使代理能够更全面地感知自身所处的状态空间。实验表明,CalVerT在无需训练和需训练两种设置下均能有效提升性能,在四个问答基准上显著提高F1分数,既能识别出依赖参数化知识过度推断的场景并触发必要检索,又能减少已有上下文已充分支持答案时的冗余检索。此外,CalVerT可无缝集成至现有问答框架而无需额外训练,且在强化学习训练后仍能带来进一步性能提升,证明其对训练后系统亦具增强价值。

链接: https://arxiv.org/abs/2606.21777
作者: Ashwin Vinod,Ying Ding,Elias Stengel-Eskin
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over-retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent’s state with additional telemetry: a calibrated self-confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training-free and training-based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over-rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent’s state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.

[NLP-137] Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning

【速读】: 该论文旨在解决大语言模型在多步推理过程中产生流畅但不准确答案的问题,尤其是传统纠错方法在修正错误时可能破坏本已正确的答案。其核心解决方案是提出一种测试时的迭代去噪自校正机制(Denoising Iterative Self-Correction, DISC),将验证环节输出视为对解题路径中潜在错误位置的噪声测量信号,并通过多轮“验证-判断-修正”迭代逐步降低错误率,类比于传统的迭代去噪过程。DISC的关键在于引入一个二元判断门控机制,在确保不破坏已有正确答案的前提下控制修正精度;同时由验证器与校正器协同工作完成错误修复。为评估该方法在精确性与召回率之间的权衡,论文设计了改进-退化比率(精度)和修复率(召回率)两个配对诊断指标。实验结果表明,DISC在三个基准测试(BIG-Bench Mistake、HotpotQA、GPQA Diamond)及四种模型上均优于Chain-of-Verification与Self-Refine,在BIG-Bench Mistake(Sonnet 4.5)上达到81.6%准确率,且每出现一次退化可实现13倍于Chain-of-Verification、5倍于Self-Refine的改进次数。此外,研究发现当验证与判断任务交由不同于生成模型的另一模型执行时,可有效缓解自我确认偏差,进一步提升校正效果。

链接: https://arxiv.org/abs/2606.21724
作者: Shen Yin,David Ken,Joel Stremmel
机构: Thomson Reuters Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models produce fluent but often incorrect multi-step reasoning, and naive correction methods risk degrading already-correct answers. We introduce Denoising Iterative Self-Correction (DISC), a test-time procedure that treats verification question outputs as noisy measurements of where a solution may be corrupted. Using these signals, DISC progressively reduces errors across multiple verify-judge-correct passes, analogous to traditional iterative denoising. A binary judgment gate controls correction precision by blocking rewrites that would damage already-correct answers while the verifier and corrector together repair errors. We evaluate this trade-off using two paired diagnostics: an improvement-to-degradation ratio (precision) and a repair rate (recall). Across three benchmarks (BIG-Bench Mistake, HotpotQA, GPQA Diamond) and four models, DISC dominates Chain-of-Verification and Self-Refine on the precision-recall trade-off, reaching 81.6% accuracy with 13x more improvements per degradation than Chain-of-Verification and 5x more than Self-Refine on BIG-Bench Mistake (Sonnet~4.5). On GPQA Diamond, we identify a capability floor below which judges acknowledge contradictions in evidence but cannot translate that recognition into a correction. We further show that cross-model role allocation – assigning verification and judgment to a model different from the generator – mitigates self-confirmation bias.

[NLP-138] Leverag ing LaBSE with Progressive Curriculum Learning for Multicultural Polarization ACL2026 SEMEVAL

【速读】: 该论文旨在解决多语言、多文化背景下在线极化现象检测的难题,尤其针对低资源语言中存在的数据稀缺问题。其核心挑战在于如何在缺乏足够标注数据的语言中实现有效的跨语言极化识别。解决方案的关键在于创新性地采用LaBSE(Language-agnostic BERT Sentence Embeddings)嵌入表示,这一方法通常用于信息检索任务,但在此被应用于增强跨语言学习能力,显著提升了低资源语言场景下的模型性能,最高可使宏平均F1得分提升0.2。此外,研究还通过全面的消融实验评估了Qwen系列不同编码器模型在基于检索提示(retrieval-based prompting)框架中的表现,进一步验证了所提架构的有效性与鲁棒性。

链接: https://arxiv.org/abs/2606.21718
作者: Sachin Sundar,Sandeep Kumar,Mothish M
机构: Indian Institute of Technology, Kharagpur; Indian Institute of Technology, Madras
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at Semeval, ACL 2026

点击查看摘要

Abstract:Detecting online polarization remains a critical challenge, particularly in multilingual and multicultural contexts where intergroup hostility is prevalent. The problem is particularly challenging due to the data scarcity for these tasks in the low-resource languages. Identifying such phenomena has become an active area of research and is addressed in SemEval-2026 Task 9: Multilingual, Multicultural Online Polarization Detection. To address this problem we propose an architecture that leverages LaBSE embeddings - an unconventional choice typically reserved for retrieval tasks, to obtain strong cross-lingual learning which enhances scores in low-resource language by a score up to 0.2 macro F1. Furthermore, we provide a comprehensive ablation study evaluating the performance of diverse encoder models in the Qwen model family within a retrieval-based prompting framework. Our code will be soon available at this https URL.

[NLP-139] When Compression Helps and When It Hurts: Condition-Aware Analysis of Chain-of-Thought Distillation

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)蒸馏中因教师模型生成的推理轨迹过于冗长而导致训练与推理成本过高的问题。现有方法主要分为选择性剪枝与生成重写两类,但此前研究未能清晰解耦关键影响因素:剪枝中的粒度与重要性判据相互混淆,重写中的结构重构程度缺乏独立分析,且压缩预算在不同领域和推理模式下的系统性评估不足。为此,本文从重要性判据、重构层级和压缩预算三个维度重新构建CoT压缩框架,并在数学(Math)与通用(General)两大领域及长/短链推理(Long-/Short-CoT)两种范式下进行系统性实验。研究发现:(i)重要性判据的有效性严格依赖于粒度——步骤级判据可收敛至共享的推理主干,而词元级剪枝需依赖符号感知信号以保留逻辑核心;(ii)重构层级在不同领域呈现相反趋势:数学任务对结构破坏敏感,性能随重构程度单调下降;而通用任务中激进重写反而具有去噪作用;(iii)训练阶段的压缩并不必然带来推理阶段的效率提升:长链学生模型即使在简洁监督下仍保留冗余习惯,表明训练压缩率仅为部署成本的乐观下界。上述发现为根据具体部署场景匹配合适压缩策略提供了条件感知的指导原则。

链接: https://arxiv.org/abs/2606.21704
作者: Siyang Lyu,Zhijing Sun,Xinghao Chen,Tong Liu,Dawei Zhu,Xiaoyu Shen
机构: Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院,东方理工大学); Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院); The Hong Kong Polytechnic University(香港理工大学); LMU Munich(慕尼黑路德维希-马克西米利安大学); Saarland University(萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) distillation transfers multi-step reasoning from large reasoning models to smaller students, but verbose teacher traces inflate both training and inference cost. Existing CoT compression methods fall into two families, selective pruning and generative rewriting, yet prior studies have left key factors entangled: granularity is confounded with importance criteria in pruning, restructuring level is rarely isolated in rewriting, and compression budgets are not systematically evaluated across domains or regimes. We recast CoT compression along three dimensions: importance criterion, restructuring level, and compression budget. Sweeping these across two model families, Math and General domains, and Long-/Short-CoT regimes, we find that (i) importance criterion utility is strictly governed by granularity: step-level criteria converge on a shared reasoning backbone, while token-level pruning requires symbol-aware signals to preserve the logical core; (ii) restructuring level inverts across domains: Math degrades monotonically with structural disruption, while aggressive rewriting acts as a denoiser on General tasks; (iii) training-time compression does not necessarily translate to inference-time savings: Long-CoT students retain verbose habits despite concise supervision, making the training ratio an optimistic lower bound on deployment cost. These findings yield condition-aware guidelines for matching compression to deployment context.

[NLP-140] A Hybrid Multi-Layered Pipeline for Phishing and Threat Classification: Independently Validated URL and NLP Engines with a Calibrated Multi-Channel Fusion Stage

【速读】: 该论文旨在解决网络钓鱼(Phishing)攻击的多模态威胁检测问题,即如何有效融合来自不同信息模态(如URL、邮件头、内容文本)的异构特征以提升整体检测性能。其核心解决方案在于构建一个分阶段的混合检测管道:首先分别针对每个模态设计专用引擎——包括四阶段URL分析栈(涵盖域名防护、词法模型、威胁情报与非对称L2融合辅助模块)、鲁棒性强化的DistilBERT自然语言处理(NLP)分类器(在未见真实钓鱼样本上的召回率从0.8%显著提升至87.3%),以及具备端到端OpenTelemetry可观测性的威胁情报同步器(确保消息1:1一致性);随后通过决策层融合机制,在包含10,677封邮件的全系统基准测试中实现F1值0.914,并采用校准后的概率或(probabilistic-OR)策略综合URL、邮件头和钓鱼概率通道的输出,同时将未见真实垃圾邮件的误报率控制在3.6%。研究指出,可部署检测系统的根本约束在于模型在分布外数据上的泛化能力,而非同分布下的准确率表现。

链接: https://arxiv.org/abs/2606.21690
作者: Saifelden M. Ismail,Aser O. Ibrahim,Omar A. Mahmoud
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Graduation project, Zewail City of Science and Technology. Code and documentation: this https URL . Whole-system fusion results use proxy URL and header channels; treat integrated metrics as preliminary

点击查看摘要

Abstract:Phishing is a multi-modal threat. We present a hybrid pipeline that scores each modality with its own engine and fuses the results. Three engines are built, deployed, and independently benchmarked: a four-stage URL stack (Domain Guard, lexical model, threat intelligence, and an asymmetric L2 fusion sidecar); a generalization-hardened DistilBERT NLP classifier whose held-out real-phishing recall rises from 0.8% to 87.3%; and a threat-intelligence synchronizer with end-to-end OpenTelemetry instrumentation confirming 1:1 message conservation. A decision-level fusion stage, characterized on a 10,677-email whole-system benchmark, reaches F1 = 0.914 with a calibrated probabilistic-OR over URL, header, and phishing-probability channels while cutting held-out real-spam false positives to 3.6%. Because that benchmark uses proxy URL and header channels and an operating point still needing recalibration, we present it as a preliminary integrated result. The binding constraint for deployable detection is generalization rather than same-distribution accuracy.

[NLP-141] Clinical Term Extraction using Open-Source Small Language Models

【速读】: 该论文旨在解决肌萎缩侧索硬化(ALS)临床照护中,大量关键信息散落在非结构化临床笔记中,导致下游分析难以开展的问题。其核心挑战在于如何高效、准确地从非结构化文本中提取与ALS相关的17类临床术语(涵盖功能评分、呼吸指标、药物使用及相关临床与非临床属性)。解决方案的关键是采用无需任务特定训练数据的少样本提示(few-shot prompting)策略,利用26个开源小语言模型(SLM)在经过标准化处理的JSON编码出院摘要上进行检测,并通过结构化JSON输出模板实现结果规范化。研究发现,尽管基于正则表达式的基线方法在整体微平均F1和哈明损失上表现优于多数单一SLM,但Qwen3-4B-Instruct-2507在微平均F1上表现最佳;不同模型在各类别上的性能差异显著,如部分SLM具有高精度但低召回率,而TF-IDF基线具有高召回但低精度,Hammer2.1-7b在ALSFRS-R子评分检测中表现突出。这些结果表明,应采用针对特定任务的混合式抽取流程,而非完全替代现有的基于规则的方法。

链接: https://arxiv.org/abs/2606.21689
作者: Noah Marchal,William E. Janes,Mihail Popescu,Xing Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical information for amyotrophic lateral sclerosis (ALS) care documented in unstructured clinical notes limits downstream analysis without extraction into structured formats. Open-source small language models with few-shot prompting for detecting the presence of ALS-relevant clinical terms in patient documentation were evaluated without task-specific training data. The detection task targeted 17 categories spanning functional scores, respiratory measures, medications, and related clinical and non-clinical attributes. Clinical note content was normalized from JSON-encoded discharge summaries and processed with a prompt template having structured JSON outputs. We compared 26 open-source models using aggregate, label-level, and manual-validation multilabel classification metrics. Manual validation showed that a regex rule baseline had higher overall micro-F1 and lower Hamming loss than any single SLM or TF-IDF baseline, while Qwen3-4B-Instruct-2507 was the highest-performing SLM by micro-F1. Model rankings varied by metric and label category, with the TF-IDF baseline showing high recall but low precision, some SLMs showing higher precision but lower recall, and Hammer2.1-7b showing strong performance for ALSFRS-R subscore detection. These findings support targeted hybrid extraction workflows rather than replacement of existing rule-based methods.

[NLP-142] ACO: Task-Aware Column Description Generation Using LLM s

【速读】: 该论文旨在解决真实世界表格数据中列名描述缺失或晦涩难懂的问题,尤其是在企业、领域科学和政府数据门户等场景下,由于列名常使用缩写或领域专有术语,导致下游自然语言处理任务(如NL2SQL、表格问答和实体链接)难以有效执行。现有方法主要依赖单提示大型语言模型(LLM),存在三方面缺陷:(i)对缩写的处理不一致或错误;(ii)生成内容存在幻觉或信息不全;(iii)描述冗余或模糊,影响下游性能。本文提出一种任务感知的自动列描述生成框架TACO,其核心在于设计了一个三阶段流水线:(1)缩写展开,统一标准化列名;(2)描述生成,结合同义词与搜索优化关键词生成语义丰富初始描述;(3)描述修订,通过模拟下游任务对输出进行精细化优化。此外,研究还探索了人机协同扩展,并发布了用于实体链接与模式增强的新评估数据集。大量实验表明,TACO在公开及私有数据集上均显著优于现有方法,可使下游任务性能提升最高达32%。

链接: https://arxiv.org/abs/2606.21685
作者: Ting Cai,Rakesh R. Menon,Yiru Chen,Zifan Liu,Yuan Tian,Fei Wu,Anudeep Chimakurthi,Prashanthi Ramamurthy,Sunav Choudhary,Kun Qian,Yunyao Li
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Purdue University(普渡大学); Adobe(Adobe)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 15 pages, 11 figures, 9 tables

点击查看摘要

Abstract:Generating accurate and informative column descriptions (e.g. “membership status of customers” for the column name “cust_mem”) is essential for a wide range of downstream NLP tasks on tabular data, including NL2SQL, table question answering, and entity linking. This problem arises in enterprises, domain sciences, government data portals, and so on. Despite its importance, most real-world datasets suffer from missing or cryptic documentation, often due to abbreviated column names or domain-specific jargon. Existing approaches largely rely on single-prompt large language models (LLMs), which struggle with three key issues: (i) inconsistent or incorrect handling of abbreviations, (ii) hallucinated or incomplete descriptions, and (iii) redundancy or vagueness that hinders downstream performance. We present TACO, a task-aware framework for automatic column description generation using LLMs. TACO introduces a three-step pipeline: (1) abbreviation expansion, which standardizes column names; (2) description generation, which produces initial semantic descriptions enriched with synonyms and search-oriented keywords; and (3) description revision, which refines these outputs using simulated downstream tasks. In addition, we investigate human-in-the-loop extensions and release new evaluation datasets for entity linking and schema enrichment. Extensive experiments across public and proprietary datasets show that TACO consistently outperforms existing methods, improving downstream task performance by up to 32%.

[NLP-143] Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers ICML2026

【速读】: 该论文旨在解决生成式语言模型在提供预测理由(rationale)时存在的“忠实性”问题,即模型生成的解释虽看似合理,但未必真实反映其内部决策机制。其核心解决方案是提出验证器耦合推理(verifier-coupled reasoning)框架,通过在推理轨迹中插入内联断言(inline claims),并训练一个辅助一致性头(consistency head)从理由片段的隐藏状态中预测程序化验证器输出,从而增强模型对可验证信息的编码能力。该方案的关键在于利用一致性损失引导模型将验证器信息显式编码于推理表示中,实现可解码性与可验证性的提升;然而研究发现,尽管一致性训练显著提升了验证信息的可解码性,却无法保证生成解释的忠实性——在代码任务中,模型虽能生成结构正确、语言流畅的解释,但其内容描述的是无关算法。实验证明该差距并非由模型容量不足导致,且激活补丁实验确认了验证器信号对推理的因果影响。因此,虽然一致性损失是有效的诊断工具和表征塑造手段,但并不能单独确保推理的忠实性。

链接: https://arxiv.org/abs/2606.21678
作者: Vatsal Ananthula,Adarsh Kumarappan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to the ICML 2026 AI4Math Workshop as a poster

点击查看摘要

Abstract:Language models can generate plausible rationales for their predictions, but these explanations may not faithfully represent the model’s internal reasoning. We propose verifier-coupled reasoning, a framework that inserts inline claims into reasoning traces and trains an auxiliary consistency head to predict programmatic verifier outputs from rationale-span hidden states. The central finding is a gap between decodability and faithfulness: consistency training reliably makes verifier information decodable from rationale representations, but decodability does not guarantee faithful generation. In LeanCheck (formal theorem proving), rationale-only and proof-only pooling achieve perfect directional separation under counterfactual conflict. In KataGo (Go engine), commentary spans encode 10-way win-rate buckets at 81% accuracy. Yet in a code setting, the model achieves 98.6% coupling while its generated explanations remain unfaithful: fluent prose with correct structured claims, but describing unrelated algorithms; a controlled pretrained-vs-from-scratch comparison shows the gap is not capacity-driven. Synthetic activation patching confirms causal influence (73-89% vs. 31% baseline), FEVER reveals that evidence-only pooling isolates genuine evidence sensitivity at the cost of raw accuracy, and per-claim analysis shows that consistency loss disproportionately benefits fine-grained claims over binary ones. These results establish that consistency losses are effective diagnostics and representation-shaping tools, but not sufficient conditions for faithful reasoning.

[NLP-144] Chehre: An Emoji-Prompted Video Dataset for Perceptually Diverse Facial Expression Recognition

【速读】: 该论文旨在解决现有面部表情识别数据集在动态性、表达多样性及人类感知差异建模方面的不足,特别是传统数据集多依赖静态图像、局限于基础情绪类别或仅提供单一确定性标注的问题。其核心解决方案在于构建Chehre——一个基于表情符号(emoji)提示的视频数据集,通过让参与者根据40个表情符号进行动态面部表情表演并录制视频,随后将真实面部动作迁移至合成人脸以保障隐私。在此基础上,由独立标注团队对去标识化视频进行多标签与表情符号标注,最终获得2,111段高质量视频,涵盖203名表演者和902名标注者的验证数据。研究定义了两个基准任务:主导表情识别(dominant expression recognition),评估模型是否能准确恢复人类评分最高的标签;分布式表情识别(distributional expression recognition),检验模型能否捕捉人类响应的多样性。通过随机采样与角色提示(persona prompting)生成多预测结果,实验表明当前主流视觉-语言模型在两项任务中均表现有限,最佳模型在主导表情识别上仅达到32.5%的Top-1准确率,且在分布式识别中的Spread Ratio显著低于人类参考水平。因此,Chehre为评估多样化、动态化及分布式的面部表情识别提供了关键基准。

链接: https://arxiv.org/abs/2606.21657
作者: Bita Azari,Zoe Stanley,Avneet Batra,Poorvi Bhatia,Hali Kil,Manolis Savva,Angelica Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 16 pages, 8 images

点击查看摘要

Abstract:Facial expressions are nonverbal social signals used in human interaction, but facial expression recognition datasets often focus on static images, basic emotion categories, or single deterministic annotations. We introduce Chehre, an emoji-prompted video dataset for analyzing dynamic facial expressions across a wide range of expressions for exploring inter-individual perceptual diversity. In Chehre, participants were prompted to express and record 40 facial emojis. Later, their facial motions were transferred onto synthetic faces to preserve privacy. A separate group of annotators analyzed the anonymized videos using emoji and label annotations, resulting in 2,111 high quality videos collected from 203 performers and validated by 902 annotators. We define two benchmark tasks: dominant expression recognition, which tests whether models recover the top human-rated labels, and distributional expression recognition, which tests whether models capture the diversity of human responses. We benchmark recent vision-language models using random sampling and persona prompting to generate multiple predictions per video. Results show that both tasks are challenging: among the models evaluated, the best-performing model achieves only 32.5% Top-1 accuracy on dominant expression recognition and a Spread Ratio well below the human reference on distributional recognition. Chehre provides a benchmark for evaluating diverse, dynamic, and distributional facial expression recognition

[NLP-145] ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks

【速读】: 该论文旨在解决当前计算机使用代理(Computer Use Agents)评估体系中存在的关键问题:现有评估主要聚焦于原子级桌面任务(atomic desktop tasks),而真实场景下的桌面工作往往需要在多个目标之间持续维持状态(sustaining state across multiple objectives)。为填补这一评估空白,作者提出ChainWorld框架,通过方向兼容性搜索(directional compatibility search)将原始的OSWorld原子任务组合成具有长时程特性的桌面工作负载,同时保留原有评估器。该工作负载包含347条长度为2至4的任务链,并对比同一任务序列的两种呈现方式。研究设计了单轮(single turn)与多轮(multi turn)两种评估协议:前者一次性展示所有任务,后者逐轮揭示任务。实验表明,在四个主流计算机使用代理中,最长任务链的完成率最高仅为31%;多轮评估虽提升了三个模型的表现,但两种协议仍具挑战性。更重要的是,两类评估暴露了不同的失效模式:单轮评估中的失败主要集中在结果生成的精确性(artifact precision),而多轮评估中的失败则更常表现为会话管理问题,如进度碎片化和后期任务脱节。因此,该研究的关键解决方案在于构建一个能够模拟真实桌面工作流的长时程评估框架,并揭示不同评估范式下代理能力的差异性瓶颈。

链接: https://arxiv.org/abs/2606.21654
作者: Vincent Siu,Manasi Sharma,Dawn Song,Daniel Yue Zhang,Chenguang Wang
机构: Scale AI; University of California, Berkeley (加州大学伯克利分校); University of California, Santa Cruz (加州大学圣克鲁斯分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer use agents are evaluated almost exclusively on atomic desktop tasks, but realistic desktop work requires sustaining state across multiple objectives. We study this gap with ChainWorld, which composes atomic OSWorld tasks into long horizon desktop workloads through directional compatibility search while preserving the source evaluators. The resulting workload contains 347 chains of length two to four and compares two renderings of the same task sequence. In single turn evaluation, all tasks are presented together in one prompt. In multi turn evaluation, tasks are revealed one at a time. Across four current computer use agents, maximum chain completion is 31%. Multi turn evaluation improves completion for three models, but both protocols remain challenging. The two protocols also expose different failure profiles. Single turn failures concentrate on artifact precision, while multi turn failures more often reflect session management problems such as fragmented progress and later turn disengagement.

[NLP-146] EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agent ic Memory

【速读】: 该论文旨在解决现有嵌入模型(embedding models)固有的静态性问题,即在孤立地编码文本片段时忽略其上下文环境与时间顺序,导致在长上下文场景下无法有效捕捉信息的动态演变。为此,论文提出EvoEmbedding,一种能够生成可演化表示的新型嵌入模型,专为需要持续状态追踪的动态、序列化长上下文检索任务设计。其核心解决方案在于:在顺序处理输入的过程中,模型维护一个持续更新的潜在记忆(latent memory),并将该记忆与原始内容联合用于生成具有上下文感知能力的可演化嵌入(evolvable embeddings)。这一机制使相同查询在不同演化上下文中可生成不同的表示,从而实现超越传统静态语义搜索的动态检索能力。为支持该设计,研究构建了EvoTrain-180K数据集,用于联合优化潜在记忆与检索性能,并引入记忆队列以防止重复编码过程中的表征坍塌,结合段落批处理技术有效缓解序列长度差异带来的挑战,训练速度提升3.8倍。实验表明,EvoEmbedding不仅在多个长上下文检索基准上优于更大规模的专业模型(如Qwen3-Embedding-8B和KaLM-Embedding-Gemma3-12B),且在下游任务(如个性化推荐)中展现出对长达训练窗口10倍以上上下文的良好泛化能力,同时可无缝集成至代理工作流(agentic workflows)中,显著提升性能,例如,仅通过替换嵌入模型即可使基础RAG系统超越专用的代理记忆系统。

链接: https://arxiv.org/abs/2606.21649
作者: Chang Nie,Chaoyou Fu,Junlan Feng,Caifeng Shan
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Existing embedding models are inherently static: they encode text segments in isolation, ignoring their surrounding context and temporal order. This paper introduces EvoEmbedding, a novel embedding model that generates evolvable representations for retrieval. It is tailored for long-context scenarios, where information is dynamic, sequential, and requires continuous state tracking. Our design is simple: EvoEmbedding maintains a continuously updated latent memory as it sequentially processes inputs, and uses it alongside the raw content to jointly generate evolvable embeddings. Consequently, for the same query, our model adapts its representation to retrieve distinct targets based on the evolving context, going beyond static semantic search. To equip the model with this capability, we construct EvoTrain-180K, a diverse dataset for the joint optimization of latent memory and retrieval. Furthermore, we introduce a memory queue to prevent representation collapse during recurrent encoding, alongside segment-batching techniques that tackle significant length variance and accelerate training by 3.8 \times . Extensive experiments show that our model not only outperforms larger-scale specialists (e.g., Qwen3-Embedding-8B and KaLM-Embedding-Gemma3-12B) across a range of long-context retrieval benchmarks, but also generalizes well to downstream tasks (e.g., personalization) with contexts 10 \times longer than its training window. Notably, EvoEmbedding seamlessly integrates into agentic workflows to boost performance. For instance, a naive RAG pipeline equipped with our model surpasses dedicated agentic memory systems. Project Page: this https URL.

[NLP-147] Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在建模语言使用中梯度频率分布(gradient frequency distributions)方面的不足问题,尤其关注语言惯例性与统计偏好之间的细微差异。研究聚焦于语言学二元词组(linguistic binomials),如“men and women”,这些词组虽在语法上均合法,但在不同语言中存在显著的惯例化顺序差异。研究将二元词序问题形式化为分布对齐(distributional alignment)任务,并构建了一个涵盖8种语言、600个二元词对的多语言语料库。通过类别型与分布型指标,对比了真实语料中的偏好分布与6个开源大语言模型所生成的词序概率分布。结果表明,尽管模型在强惯例化词对上能行为性地恢复主流语料偏好顺序,但其对精确偏好分布的拟合程度有限,说明模型表现出的方向性偏好可能高估了其对语言使用统计细节的忠实度。稀疏探针(sparse probing)分析进一步揭示,偏好强度这一概念部分编码于模型的中后期层中,且沿探针导出方向进行干预可有效调控模型生成的词序分布,证明了大语言模型中语言统计偏好可通过内部表征实现机制性测量与操控。

链接: https://arxiv.org/abs/2606.21645
作者: Zhiqing Yang,Yilun Liu,Yunpu Ma,Volker Tresp,Hinrich Schütze
机构: Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code and data are publicly available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) can readily reproduce conventional expressions, yet their ability to model gradient frequency distributions remains underexplored. We investigate this using linguistic binomials, such as men and women, where both word permutations are grammatically valid but exhibit distinct, cross-linguistic variations in conventionality. We formalize binomial ordering as a distributional alignment problem, and construct a multilingual dataset of 600 binomial pairs across 8 languages. With categorical and distributional metrics, we measure and compare the corpus-derived preferences with model-induced ordering probabilities of 6 open-weight LLMs. While models often behaviorally recover the dominant corpus-preferred order, particularly for strongly conventionalized pairs, they align less well with the exact corpus preference distributions. This suggests that apparent directional order overstates how faithfully LLMs capture the statistical nuances of language use. Sparse probing verifies that the concept of preference strength is partially encoded among middle-to-late layers, and steering along probe-derived directions alters model-induced ordering distributions, demonstrating that the statistical behavioral preference of LLMs can be mechanistically measured and manipulated via internal representations.

[NLP-148] oward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLM s

【速读】: 该论文旨在解决开放权重大语言模型(Open-weight Large Language Models, LLMs)在促进科学进步与广泛应用的同时,难以有效管控敏感能力访问的问题。现有方法要么在发布前抑制潜在危险能力(导致对所有用户过度限制),要么依赖封闭服务通过专用模型变体、输入/输出监控及API权限进行访问控制,但后者与开放权重理念根本冲突。为此,本文提出分层语言模型(Tiered Language Models, TLMs),其核心创新在于:单一公开发布的模型权重可支持多个能力层级。在默认公共配置下,TLM表现如常规LLM;通过一个紧凑的密钥对小规模参数子集施加置换操作,可触发同一组权重上的替代计算图,从而解锁额外能力。研究设计了一种联合训练协议,先从零开始联合预训练两种配置,再对密钥配置在私有数据上进行正则化微调以保持公共模型的行为一致性。实验表明,180M与650M参数的TLM在密钥配置下可习得新语言、具备指令遵循能力并记忆私有事实知识,而公共配置则不具备这些特性。此外,该方法可自然扩展至多级分层架构。由于授权机制基于模型权重结构而非输入空间,该方案能有效抵御基于微调的提取攻击及部分密钥泄露风险。总体而言,TLM为实现开放权重发布与选择性能力控制之间的平衡提供了可行路径。

链接: https://arxiv.org/abs/2606.21638
作者: Charbel El Feghali,Arkil Patel,Nicholas Meade,Spandana Gella,Verna Dankers,Siva Reddy
机构: Mila and McGill University (Mila 和麦吉尔大学); ServiceNow Research
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Preprint. 28 pages

点击查看摘要

Abstract:Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use specialized model variants, input/output monitors, and API permissions. The former is susceptible to jailbreaks while sacrificing capability for all users to mitigate the risks posed by a few, and the latter is fundamentally incompatible with open-weight release. In this paper, we propose Tiered Language Models (TLMs), where a single set of released weights supports multiple capability levels. In its default public configuration, a TLM behaves as a conventional LLM. A compact secret key specifies a permutation over a small parameter subset, inducing an alternative computation graph over the same weights that exposes additional capabilities. We develop a training protocol that jointly pretrains both configurations from scratch, then fine-tunes the keyed configuration on private data with regularization to preserve the public model’s behavior. We pretrain 180M- and 650M-parameter TLMs and demonstrate that the keyed configuration can acquire a new language, gain instruction-following ability, and memorize private factual knowledge, whereas the public configuration exhibits none of these capabilities. Moreover, we show that our approach extends naturally to multiple hierarchical tiers. Because authorization operates on the model’s weight structure rather than in the input space, the mechanism resists fine-tuning-based extraction and partial key compromise. In general, TLMs take a step toward reconciling open-weight release with selective capability control.

[NLP-149] me-Frequency Weighted Losses for Phoneme Reconstruction in DNN-Based Speech Enhancement INTERSPEECH2026

【速读】: 该论文旨在解决传统语音增强方法中基于信号失真比(SDR)的训练损失对时频(TF)区域处理过于均一的问题,忽略了与特定音素可懂度相关的精细频谱线索。其解决方案的关键在于提出一种时频加权框架,通过引入局部语音存在度、语音干扰比(SIR)以及谱流(spectral flux)三个因素,构建可微分的目标函数,从而动态强化语音-噪声竞争激烈的时频单元,并关注辅音爆破等瞬态特征。该方法在客观指标上提升了加权频率性能,显著改善了辅音等难识别音素的识别准确率,同时在不利SIR条件下实现了中频结构的更优重建。

链接: https://arxiv.org/abs/2606.21635
作者: Nasser-Eddine Monir,Paul Magron,Romain Serizel
机构: Université de Lorraine (洛林大学); CNRS (法国国家科学研究中心); Inria (法国国家信息与自动化研究所); LORIA (洛林计算机科学研究实验室)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Conventional training losses for speech enhancement based on the signal-to-distortion ratio (SDR) treat all time-frequency (TF) regions uniformly, overlooking the fine-grained spectral cues that are relevant to specific phoneme intelligibility. We propose a TF weighting framework that modulates the SDR objective based on local speech presence, speech-to-interference ratio (SIR), and spectral flux. By integrating these factors into a differentiable objective, the framework emphasizes TF bins with high speech-noise competition while also accounting for transient cues such as consonant bursts. Experimental results show that our approach improves objective frequency-weighted enhancement metrics, as well as phoneme recognition accuracy, particularly for consonants. Spectral analysis shows better reconstruction of mid-frequency structures at less adverse SIR.

[NLP-150] CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练流程中数据治理(Data Curation)的碎片化问题,即现有工具将数据摄入、去重、合成生成与质量过滤等环节割裂处理,导致难以追溯决策过程或理解样本被拒的具体原因。其解决方案的关键在于提出CuratorKIT——一个基于Python的开源库,实现了从数据摄入到训练导出的全生命周期统一可配置流水线。该框架通过六种源格式读取器与自动模式检测支持多源数据接入,内置预生成阶段的数据清洗层以识别敏感信息(如凭据、个人身份信息PII及有害内容),集成八项由大语言模型驱动的生成任务,并设置三个互补的质量评估门控机制,具备溯源精确的幻觉验证能力;同时支持结构化自适应恢复策略与五种兼容TRL、Unsloth、AlignTune等主流训练框架的输出格式。所有决策均记录于不可变的逐样本溯源链中,拒绝样本附带结构化的失败原因,避免沉默丢弃。系统通过LiteLLM支持100余种大语言模型服务提供商,提供原生Python API与YAML驱动的CLI接口,专为需要可复现、可审计、规模化数据处理的实践者设计。

链接: https://arxiv.org/abs/2606.21631
作者: Soham Bhattacharjee,Karun Sharma,Vinay Kumar Sankarapu,Pratinav Seth
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data curation is a critical part of post-training pipelines for large language models, yet existing tools often treat ingestion, deduplication, synthetic generation, and quality filtering as separate stages. This fragmentation makes it difficult to audit pipeline decisions or understand why individual samples are rejected. CuratorKIT is an open-source Python library that covers this full lifecycle in a single configurable pipeline. The framework is composed of six source format readers and automatic schema detection, a pre-generation data hygiene layer for credentials, PII, and toxic content, eight LLM-powered generation tasks, three complementary quality gates with provenance-exact hallucination verification, structured adaptive recovery, and five training-ready export formats compatible with TRL, Unsloth, and AlignTune. Every pipeline decision is recorded in an append-only per-sample provenance chain, and rejected samples carry structured failure reasons rather than being silently discarded. CuratorKIT supports 100+ LLM providers through LiteLLM, exposes both a Python API and a YAML-driven CLI, and is designed for practitioners who need reproducible, auditable data pipelines at scale .

[NLP-151] Evaluating Document-Tuned Transformer Representations for Person-level Mental Health Assessment

【速读】: 该论文旨在解决个体层面心理评估中如何有效整合同一用户多条文本消息语义信息的问题,传统文档级训练目标并未专门针对此类任务设计。其解决方案的关键在于系统性地比较在相同条件下架构对齐的两类模型:基础型(base-transformers)与文档调优型(document-tuned-transformers,即在文档层级进一步通过对比学习微调的“句子嵌入”模型)。研究发现,文档调优模型在纵向心理健康数据集上展现出一致更优的性能(皮尔逊相关系数提升13.4%,p = .015),且在词删除、同义词替换、拼写错误注入及反向翻译等扰动下仍保持更强鲁棒性。此外,文档调优嵌入更能捕捉不确定性表达(如“通常”),而基础模型则更倾向于表征数量丰富性(如“很多”),表明前者在心理状态建模中具有更优的语义表征能力。结果表明,表示学习的选择显著影响心理健康预测效果,文档调优模型通常更具优势。

链接: https://arxiv.org/abs/2606.21622
作者: Aaron Marker,Oscar Kjell,Vasudha Varadarajan,H. Andrew Schwartz
机构: Vanderbilt University; Lund University; Carnegie Mellon University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Person-level psychological assessment requires aggregating meaning across many messages from the same individual, a task that document-level training objectives were not explicitly designed for. We present a systematic, empirical comparison between architecturally matched traditional (a) base-transformers and (b) document-tuned-transformers (further contrastively fine-tuned at the document-level, sometimes referred to as “sentence transformers”) under otherwise identical conditions. Comparing layer-wise and overall performance across two longitudinal mental health and psychological datasets, we find document-tuned models demonstrated a consistent improvement over base representations (increase in Pearson r of 13.4%, p=.015). Robustness analyses revealed document-tuned models remained more accurate under perturbations to word deletion, synonym replacement, typo injection, and back translation. Further, hedged language (e.g., usually') was more characteristic of outcomes in document-tuned embeddings while abundance (e.g., lot’) was more characteristic of base-transformers, suggesting document-tuned models may better capture uncertainty. These results suggest representation choice impacts mental health prediction, document-tuned models often being more adept.

[NLP-152] CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估中国文化遗产(Chinese Cultural Heritage, CCH)理解能力时,过度关注最终答案准确率而忽视推理过程的细粒度质量评估问题。现有基准测试缺乏对推理路径中视觉、文本、风格及历史线索整合能力的系统性衡量,导致模型性能评估存在偏差。为此,研究提出CulMind与CulMind-R:一个涵盖超过100家博物馆、包含50个任务的高质量多模态CCH基准数据集,以及一个24个任务的推理子集,该子集通过自适应定义任务特异性维度来实现对推理过程的精细化评估。其核心解决方案在于提出ReaScore——一种任务自适应的评估指标,能够基于任务相关性自动加权不同推理维度,从而更精准地反映专家判断。实验表明,主流14种MLLM在答案与推理质量之间存在显著差距,尤其在复杂任务上表现不佳;进一步分析证实,任务自适应的维度选择与加权机制能显著提升评估结果与人工专家评价的一致性。整体而言,该工作为文化遗产理解能力的评估提供了更具专家对齐性的方法论框架,并具备向其他文化领域迁移的潜力。相关数据、代码与评估脚本已公开发布,以支持可复现的研究。

链接: https://arxiv.org/abs/2606.21618
作者: Zhangwei Cao,Shuhan Fan,Yuting Wei,Jiajun Zhang,Yihang Peng,Qi Meng,Yangfu Zhu,Liangbin Yang
机构: University of International Relations (国际关系学院); Kedge Business School (凯捷商学院); Peking University (北京大学); Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学); Capital Normal University (首都师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating Multimodal Large Language Models (MLLMs) in Chinese Cultural Heritage (CCH) requires fine-grained reasoning over visual, textual, stylistic, and historical clues. However, existing CCH benchmarks mainly emphasize final-answer accuracy, while the accuracy and completeness of reasoning processes remain underexplored. To address this gap, we introduce CulMind and CulMind-R: a high-quality benchmark for multimodal CCH covering 50 tasks from collections of more than 100 museums, and a 24-task reasoning subset that adaptively defines task-specific dimensions for reasoning process evaluation. To evaluate reasoning quality, we propose ReaScore, a task-adaptive metric that evaluates reasoning by automatically weighting task-relevant dimensions. Experiments on 14 leading MLLMs reveal a substantial gap between answers and reasoning, especially on challenging tasks. Further analysis shows that task-adaptive dimension selection and weighting better align evaluation results with expert judgments. Overall, our benchmark and metric support a more expert-aligned assessment of CCH understanding and offer a transferable reference for broader evaluations of cultural heritage. We publicly release the data, code, and evaluation scripts at this https URL to facilitate reproducible research.

[NLP-153] LLM and Human Modes of Representation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在认知基础层面与人类认知系统之间的差异问题,具体聚焦于大型语言模型(LLMs)在语言知识表征及现实世界推理与规划任务中的表现。研究核心在于揭示 LLMs 在处理语言信息时虽能实现高水平的流畅性与性能,但其内在机制与人类的认知过程存在显著差异;同时,在需要学习与泛化能力的推理任务中,LLMs 的效率普遍低于人类。解决方案的关键在于通过对比分析两类系统在语言知识表征和现实世界推理任务中的共性与差异,为构建更接近人类认知模式的智能系统提供理论依据与方向指引。

链接: https://arxiv.org/abs/2606.21616
作者: Shalom Lappin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Much work on the cognitive foundations of AI has focussed on comparisons between the ways in which Large Language Models (LLMs) and humans process information and represent it. One aspect of this comparison involves determining the extent to which LLMs can achieve or surpass human performance on a variety of cognitively interesting tasks. A second explores points of convergence and divergence between LLM and human systems for processing information. Here, I consider some recent research that has addressed both issues in two informational domains. The first is the representation of linguistic knowledge. The second is real world reasoning and planning. While LLMs frequently achieve impressive levels of performance and fluency on linguistic applications, they tend to handle linguistic content in ways that are distinct from human processing. They are also, for the most part, less efficient than humans in learning and generalisation for reasoning tasks.

[NLP-154] Rubric-as-Experts: Case-Specific MQM Rubrics for Translation Quality Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在细粒度翻译质量评估(Fine-grained Translation Quality Evaluation, QE)中,因采用固定预设的MQM(Machine Translation Quality Metrics)评分标准配置而导致的评估效果不佳问题。现有方法普遍使用统一的静态评分体系,难以适应不同翻译样本在错误复杂性、歧义程度及评估粒度上的差异,从而影响了细粒度错误定位的准确性和有效性。其核心解决方案在于提出一种案例自适应动态评分框架(case-specific dynamic rubric framework),该框架在保持与预定义MQM分类体系一致性的前提下,根据每个翻译实例的具体特征,动态选择合适的子类型空间与评估粒度,实现评分空间的个性化配置。实验结果表明,该方法在多个模型规模下的WMT细粒度QE基准测试中,显著提升了马修相关系数(MCC),并实现了更精准的细粒度错误定位,验证了将结构化MQM评分体系与案例特异性自适应分配相结合的有效性。

链接: https://arxiv.org/abs/2606.21559
作者: Weilu Xu,Yunzhi Shen,Xinye Wang,Ranfei Dang,Shujian Huang
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: Computation and Language (cs.CL)
备注: 18 pages including appendix, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown strong potential in fine-grained translation quality evaluation (QE), yet existing MQM-based approaches typically rely on fixed rubric configurations shared across all translation samples. However, translation instances often differ substantially in error complexity, ambiguity, and required evaluation granularity, making static rubric allocation suboptimal for span-level error detection. We find that larger MQM subtype spaces improve error coverage but also introduce more false positives, while different translation instances prefer different rubric granularities, suggesting that evaluation spaces should be allocated dynamically for each case. Motivated by these observations, we propose a case-specific dynamic rubric framework that adaptively constructs MQM evaluation spaces for individual translation instances. Unlike fully free-form rubric generation methods, our framework remains grounded in the predefined MQM taxonomy while dynamically selecting suitable subtype spaces and evaluation granularity for different cases. Experiments on WMT span-level QE benchmarks across multiple model scales demonstrate that the proposed framework consistently improves MCC and produces cleaner span-level error localization compared with static rubric settings. Our results suggest that combining structured MQM rubrics with case-specific adaptive allocation is an effective strategy for fine-grained LLM-based translation evaluation.

[NLP-155] PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving ACL2026

【速读】: 该论文旨在解决教育领域中缺乏真实情境下学生间协作解决问题(Collaborative Problem Solving, CPS)对话数据集的问题。现有教育对话数据集多聚焦于师生互动(如课堂教学或辅导),而针对小组内学生间互动的高质量、真实课堂采集数据极为有限,制约了对学生成员如何在实际学习环境中进行交流、协调与共同解题行为的研究。为此,本文提出了PeerMathDial——首个从真实中学数学课堂中收集的学生间协作解题对话数据集,包含27名学生参与的55段对话,共计6,406轮对话。其解决方案的关键在于:一是构建了一个基于语料的对话行为分类体系(dialogue act taxonomy),借助大语言模型(LLM)辅助实现精细化标注;二是通过该数据集与标注体系,实现了三项应用验证:追踪对话动态演化并评估教师干预效果、将对话行为与学生自我报告特质(如自信心、领导力)关联分析、以及利用大语言模型进行对话行为预测以探索其在教育场景中模拟学生行为的潜力。该工作为深入研究学生协作认知过程提供了可扩展的数据基础与方法支持。

链接: https://arxiv.org/abs/2606.21557
作者: Murong Yue,Desmond Alexander Mcglone,Emily Slutz,Wenhan Lyu,Yixuan Zhang,Jennifer Suh,Ziyu Yao
机构: George Mason University (乔治梅森大学); William & Mary (威廉与玛丽学院)
类目: Computation and Language (cs.CL)
备注: 17 pages. Project website (dataset and source code): this https URL . Accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA) co-located at ACL 2026

点击查看摘要

Abstract:Collaborative Problem Solving (CPS) is a core skill in education, where the process of peer interaction is highly important. However, existing educational dialogue datasets mostly focus on classroom instruction or tutoring (i.e., teacher/tutor-student interaction), yet datasets centering small-group, student-student interaction are limited. This thus leaves research with limited resources for studying how students interact, coordinate, and solve problems together in real educational settings. To address this, we introduce PeerMathDial, the first dataset of peer CPS dialogues collected from authentic middle school math classrooms. It contains 55 dialogues from 27 students, totaling 6,406 turns. To facilitate research on CPS discourse analysis, we further build a corpus-grounded dialogue act taxonomy assisted by LLMs. Using the dataset and the dialogue act taxonomy, we demonstrate the practical applications of PeerMathDial across three use cases. First, we track how dialogues evolve over time and measure the impact of teacher interventions. Second, we align dialogue actions with student surveys to reveal the connection between students’ traits (e.g., confidence, leadership) and their actual behaviors. Third, by evaluating LLMs on dialogue act prediction, we glimpse at the potential of LLMs for student simulation in educational applications. Our dataset and source code will be released to the community.

[NLP-156] MedHal-Loc: Are “Explainable-by-Architecture” Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark

【速读】: 该论文旨在解决临床文本中幻觉(hallucination)检测的可解释性问题,即现有系统不仅需要识别不可靠回答,还需精确定位错误所在的文本片段。其核心挑战在于评估检测模型是否真正实现了“定位忠实性”(localization faithfulness)——即模型标记的错误单元是否与真实错误片段存在重叠。解决方案的关键是提出MedHal-Loc基准和相应度量标准,通过构造包含4种可定位类型(实体替换、关系错误、机制误归因、虚构内容)的受控数据集(300条来自PubMedQA的语句),并基于人工标注生成黄金标准错误片段;同时引入自然场景下的真实幻觉数据集,揭示多数真实幻觉表现为扩散性的结论反转,难以进行细粒度定位(人类专家仅接受1/18候选片段)。实验评估四种细粒度检测范式发现,基于自然语言推理(NLI-per-clause)、一致性检验(consistency-per-sentence)以及专用跨度检测器FAVA均显著优于随机水平,而基于知识图谱三元组分解的复杂架构虽具备良好的检测性能(F1=0.609),但定位准确率仅比随机高3.3个百分点且无统计显著性,根本瓶颈在于约59%的实体抽取覆盖率不足。研究证明,检测能力并不等同于定位忠实性,因此必须对以知识图谱为基础的架构所宣称的可解释性进行实证验证,而非默认成立。

链接: https://arxiv.org/abs/2606.21517
作者: Minmin Chen,Daojian Lu,Yining Dai,Jvyu Cai,Fengdan Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 5 figures, 6 tables. Submitted to Computers in Biology and Medicine

点击查看摘要

Abstract:Detecting hallucinations in clinical text is increasingly framed as an explainability problem: systems should not merely flag an unreliable response but point to the offending span. Architectures built around knowledge-graph (KG) triple decomposition are marketed for exactly this auditability, yet their localization ability is typically assumed rather than measured. We introduce MedHal-Loc, a benchmark and metric for localization faithfulness – whether a detector’s top-ranked error unit actually overlaps the erroneous span. The controlled subset comprises 300 PubMedQA-derived statements with single, span-level errors injected across four localizable types (entity substitution, relation error, mechanism misattribution, invention), yielding gold spans by construction; a complementary natural subset documents that real hallucinations are dominated by diffuse conclusion-flips that resist span localization (a human expert accepted 1/18 candidate spans). Evaluating four fine-grained paradigms, we find that NLI-per-clause, consistency-per-sentence, and the dedicated span detector FAVA all localize well above chance, whereas an elaborate KG-triple pipeline localizes no better than chance (+3.3pp, n.s.), bottlenecked by ~59% entity-extraction coverage – despite competitive detection F1 (0.609). Detection competence does not imply faithful localization; architectural explainability must be validated, not presumed.

[NLP-157] owards Pedagogically Aligned LLM Tutors for Math Mistake Remediation

【速读】: 该论文旨在解决大语言模型在智能辅导系统中缺乏有效教学策略的问题,特别是其在引导学生学习时容易直接透露最终答案,违背了建构性辅导(scaffolding)等关键教学原则。其解决方案的关键在于提出一种两阶段对齐流程:首先在真实辅导对话数据上进行监督微调,随后利用合成的偏好对(synthetic preference pairs)通过直接偏好优化(Direct Preference Optimization, DPO)进一步优化模型的教育行为。该方法构建了一个融合现有辅导语料与基于教学维度(如支架式支持和事实准确性)生成的合成数据的数据集,并探索了包含解题正确性和标准答案的不同输入配置。实验结果表明,该方法显著提升了模型的事实准确性和教学质量,优于基础模型及现有辅导模型;人工评估也显示,最佳模型在性能上可媲美强基准模型,同时具备更高的开放性、透明性与可复现性。研究强调了基于偏好的教学对齐的有效性,但也揭示了在可靠评估辅导质量方面仍存在挑战。

链接: https://arxiv.org/abs/2606.21502
作者: Kseniia Petukhova,Tien Dat Nguyen,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have strong potential for use in intelligent tutoring systems, but they often fail to follow effective pedagogical strategies, such as guiding students without revealing final answers. We study the application of a two-stage alignment pipeline for math mistake remediation, combining supervised fine-tuning on tutoring dialogs with Direct Preference Optimization on synthetic preference pairs. We construct a dataset that integrates existing tutoring corpora with synthetic data generated along pedagogical dimensions, such as scaffolding and factuality, and study different input configurations that incorporate solution correctness and gold answers. Experiments show that this approach improves both factual accuracy and pedagogical quality over base models and existing tutoring models. Human evaluation further indicates that our best model is competitive with a strong proprietary baseline, while providing additional benefits in terms of openness, transparency, and reproducibility. Our results highlight the effectiveness of preference-based pedagogical alignment, while also revealing challenges in reliably evaluating tutoring quality.

[NLP-158] Economic Transformation and Cultural Change: Evidence from Two Centuries of French Drama

【速读】: 该论文旨在解决大规模经济转型如何影响文化生产这一核心问题,聚焦于法国戏剧文本在1700至1900年间的变化,以揭示经济结构变迁与文学话语演进之间的因果关联。其解决方案的关键在于整合计算语言学、计量经济学与形式建模方法,通过潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)对1,215部戏剧文本进行主题分析,识别出贵族话语中以主权与政治权威为核心的议题逐渐被资产阶级及家庭经济主题所取代的历史趋势。进一步采用贝叶斯向量自回归模型(Bayesian Vector Autoregressive Model)结合最大份额冲击识别法,揭示文学对经济冲击的响应存在时序分化:18世纪资产阶级日常生活主题已对国内生产总值(GDP)波动产生反应,而家庭经济议题的敏感性则显著增强于1820年工业加速发展之后。通过离散选择模型发现,作者间的同侪效应与对经济环境的敏感性共同驱动了这一演变过程,蒙特卡洛模拟亦成功复现了历史轨迹的定量特征。该研究构建了一个基于可识别社会机制的量化框架,阐明了经济转型如何通过微观行为互动传导至文化生产,为文化演化研究及制度与文学话语长期关系提供了实证支持。

链接: https://arxiv.org/abs/2606.21485
作者: T. D. Oliveira,L. A. Attilio,M. J. Davila-Fernandez
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How do large-scale economic transformations shape cultural production? We address this question by combining computational linguistics, econometrics, and formal modelling, using French drama as a well-documented empirical laboratory. Applying latent Dirichlet allocation to a corpus of 1,215 theatrical texts published between 1700 and 1900, we show that aristocratic discourse centred on sovereignty and political authority was gradually displaced by bourgeois and household economic themes as French capitalism developed. Bayesian vector autoregressive models with max-share shock identification suggest a temporal shift in the literary response to economic shocks: bourgeois everyday-life themes reacted to GDP shocks in the eighteenth century, whereas household-economic concerns became responsive only after 1820, amid accelerating industrialisation. A discrete-choice model shows that peer effects among authors and sensitivity to prevailing economic conditions can jointly account for these dynamics. Monte Carlo simulations reproduce the observed historical trajectory with reasonable fidelity. These findings offer a quantitative framework for understanding how economic transformations propagate into cultural production through identifiable social mechanisms, contributing to the study of cultural evolution and the long-run relationship between institutions and literary discourse.

[NLP-159] Evaluation of Small Language Models for Arabic Language Processing

【速读】: 该论文旨在解决阿拉伯语自然语言处理(Natural Language Processing, NLP)领域中小型语言模型(Small Language Models, SLMs)性能评估缺乏系统性基准的问题。现有研究在评估阿拉伯语模型时往往缺乏统一、全面且覆盖多任务类型的测试集,导致模型比较与优化缺乏可比性和指导性。为此,本文提出一个包含240个测试样本的基准数据集,涵盖八个领域和十种语言能力,既包括理解类任务也涵盖生成类任务,并在严格的零样本(zero-shot)条件下,使用标准化的纯阿拉伯语提示模板对12个SLMs进行评估。其解决方案的关键在于构建了一个基于多模型大语言模型作为裁判(LLM-as-a-judge)的评估框架,整合GPT-4.1 Mini、Claude Haiku 4.5和DeepSeek-Chat三个先进模型的评分结果,通过聚合与分项分析实现更客观、可靠的性能评判。研究发现,模型规模并非决定性能的唯一因素,具备更强阿拉伯语对齐能力(Arabic alignment)和更稳定指令遵循行为的模型表现更优;而低分模型普遍存在提示泄露(prompt leakage)、幻觉(hallucination)、语言漂移(language drift)、生成不完整及任务遵从性差等共性缺陷。该基准为高效、可靠且符合文化语境的阿拉伯语AI系统研发提供了结构化参考。

链接: https://arxiv.org/abs/2606.21460
作者: Jumana Alsubhi,Ahmed Alhusayni,Abdulrahman Gharawi,Israa Hamdine,Alshaymaa Allahim,Lamees Alhumaid,Ahmad Shabana,Rafik Madani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper evaluates the performance of twelve Small Language Models (SLMs) on Arabic natural language processing tasks. The study introduces a benchmark of 240 Arabic test items distributed across eight domains and ten language skills, covering both comprehension-oriented and generation-oriented tasks. All models were evaluated under a controlled zero-shot setting using a standardized Arabic-only prompt template. Model responses were assessed through a multi-model LLM-as-a-judge framework involving GPT-4.1 Mini, Claude Haiku 4.5, and DeepSeek-Chat, with scores aggregated across judges and analyzed by task, skill, and model family. The results show that Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic. The observed results suggest that model size alone does not explain Arabic SLM performance. Models with stronger Arabic alignment and more reliable instruction-following behavior tended to perform better across tasks. Common failure patterns among lower-performing models include prompt leakage, hallucination, language drift, incomplete generation, and weak task adherence. Overall, the benchmark provides a structured reference for evaluating compact Arabic language models and supports future work on efficient, reliable, and culturally appropriate Arabic AI systems.

[NLP-160] Precision Recall Controllable Radiology Report Generation via Hybrid Natural Language and Clinical Reward Learning MICCAI2026

【速读】: 该论文旨在解决自动化放射科报告生成(RRG)中临床准确性与语言流畅性之间失衡的问题,即现有方法虽在自然语言生成(NLG)指标上表现良好,却难以有效控制临床关键指标如精确率(precision)和召回率(recall),导致生成报告虽语言流畅但临床对齐性不足。其解决方案的关键在于提出一种基于强化学习的可调控精确率-召回率(precision-recall)的RRG框架,通过引入一个可控参数在推理阶段动态调节精确率与召回率之间的权衡,实现根据不同临床需求灵活生成报告。同时,为确保临床正确性,设计了基于临床奖励(clinical reward)的训练目标,显著提升临床有效性(CE),并采用组内相对训练策略(group-relative training),通过组内奖励归一化降低奖励方差,增强训练稳定性。实验结果表明,该方法在MIMIC-CXR数据集上不仅优于现有先进方法,在NLG和CE评估指标上均表现更优,且能可靠地实现对临床性能的精确调控。

链接: https://arxiv.org/abs/2606.21447
作者: Ling Chen,Ruinan Jin,Jun Luo,Hanliang Chen,Quirin Strotzer,Rongkai Yan,Yuan Xue,Luciano Prevedello,Dufan Wu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026

点击查看摘要

Abstract:Automated radiology report generation (RRG) has gained increasing attention because it can reduce the heavy workload of clinical report writing. However, most existing methods mainly optimize for natural language generation (NLG) metrics that focus on language fluency, while providing little control over clinically important factors such as precision and recall. As consequence, generated reports may be fluent but not well aligned with different clinical needs. To address this challenge, we propose a reinforcement learning framework for precision recall controllable RRG, where a control parameter explicitly adjusts the trade-off between clinical precision and recall during inference. This design allows the model to flexibly generate reports according to different clinical requirements. To ensure clinical correctness, we introduce a \blueclinical reward into the training objective, which helps improve clinical efficacy (CE) beyond standard language-based optimization. In addition, we apply a group-relative training strategy that normalizes rewards within each training group, reducing reward variance and improving training stability. Extensive experiments on the MIMIC-CXR dataset show that our method consistently outperforms state-of-the-art approaches in both NLG and CE evaluation metrics, while providing reliable control over the CE precision recall trade-off.

[NLP-161] CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation

【速读】: 该论文旨在解决在大型多语言翻译模型已具备强大性能的背景下,针对特定语对(如日语-英语)是否仍有必要开发专用小型翻译模型的实践性问题。其核心解决方案在于构建一系列参数量分别为0.8B、1.4B、3.3B和7B的小型专用模型,并采用两阶段监督微调结合多目标广义相对优势策略(Multi-Objective GRPO)的方法,在合成生成的平行语料上进行训练。关键创新点在于通过针对性优化与高效训练策略,使小型专用模型在真实场景(涵盖商业、法律、医疗、金融及专利等专业领域)的翻译任务中显著超越大型多语言模型的表现,验证了在实际应用中开发专用模型仍具有重要价值。

链接: https://arxiv.org/abs/2606.21413
作者: Yuu Jinnai
机构: CyberAgent / Tokyo, Japan
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Nowadays, large multilingual translation models demonstrate impressive translation capabilities in the machine translation benchmarks. This raises a practical question to the developers: is it worth developing translation models specialized for a particular language pair if you only need to support that language pair? To give an anecdotal answer to this question, we develop a family of small language models (0.8B, 1.4B, 3.3B, and 7B parameters) specialized for Japanese-English bidirectional translation. We employ a two-stage supervised fine-tuning approach followed by Multi-Objective GRPO (Ichihara et al. 2025) to train models on synthetically generated parallel corpora. We evaluate our models on WMT and real-world translation benchmarks across business, legal, medical, financial, and patent domains. While multilingual models achieve strong performance on WMT benchmarks, our compact models outperform them on real-world benchmarks, suggesting the practical utility of developing specialized translation models even in the era of large multilingual models.

[NLP-162] Finetuning with Scientific Data Increases Hallucinations: A Multi-domain Factuality Evaluation of LLM s

【速读】: 该论文旨在解决生成式AI在科学领域应用中因幻觉(hallucination)问题导致的事实可靠性风险,尤其针对当前科学领域微调的大语言模型(LLM)缺乏系统性评估的问题。现有研究多局限于生物医学领域,且将幻觉视为二元判断任务,未充分考察近年来涌现的科学专用微调模型。为此,作者提出了SciFactCheck基准,涵盖五个科学领域的2,500个提示,并构建了一个模块化评估框架,用于识别三类事实性幻觉:不可验证性、过度宣称和归属错误。通过控制性的最小差异配对设计,对比18个科学微调模型与其通用基线模型的表现,发现:1)科学微调模型在所有幻觉类型和科学领域中均表现出事实可靠性的下降;2)尽管内部置信度降低,但其语言表达却更为武断肯定。此外,人工预实验表明,现有事实核查工具与专家判断之间仅存在有限一致性,且科学可核查命题的定义在人类标注者间亦存在争议。研究结果从根本上质疑了当前基于领域微调提升事实性的方法有效性,强调亟需构建更完善的科学内容验证基础设施。

链接: https://arxiv.org/abs/2606.21359
作者: Raia Abu Ahmad,Nikolas Rauscher,Ekaterina Borisova,Fabio Barth,Georg Rehm,Sebastian Möller
机构: 1. German Research Center for Artificial Intelligence (German Research Center for Artificial Intelligence); 2. University of Mannheim (曼海姆大学); 3. Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to communicate and explain scientific concepts, yet their tendency to hallucinate poses significant risks in this high stakes use-case. Prior hallucination evaluation work remains largely restricted to the biomedical domain, treats hallucination as a binary task, and has not examined the growing family of scientifically fine-tuned LLMs. We address these gaps with SciFactCheck, a benchmark of 2,500 prompts across five scientific domains, paired with a modular evaluation framework targeting three factuality hallucination types: unverifiability, overclaim, and attribution. Using a controlled minimal-pairing design, we evaluate 18 LLMs by comparing each scientifically fine-tuned model against its general-purpose base. Our results indicate that 1. Scientifically fine-tuned models exhibit degraded factual reliability across all hallucination types and scientific domains, and 2. Fine-tuned models are internally less confident yet linguistically more assertive. A human pilot study further reveals that current fact-checking tools show only modest agreement with expert judgments on scientific content, and that defining scientifically check-worthy claims remains contested even among human annotators. Our findings fundamentally challenge current methods of domain-specific fine-tuning for factuality and call for developing improved verification infrastructure for scientific content.

[NLP-163] Factual Retrieval in LLM s Is a Redundant Distributed and Non-Contiguous Process ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中实体表示如何被转换以实现特定属性检索的机制不明确的问题。其核心问题是:在模型内部,从实体表示到目标属性输出之间的“属性计算路径”(attribute-computation path)究竟是如何形成的?为此,作者提出了一种迭代修补协议(iterative patching protocol),通过系统性地屏蔽或保留不同层的激活信息,识别出完成特定属性计算所需的最小层数集合。关键发现表明,这些计算路径并非连续分布,而是常常跳过某些层,并且同一实体与事实可由多个功能等价的路径实现,揭示了属性计算的高度冗余性与分布式特性。这一结果暗示,知识在大语言模型中的存储与检索机制远未被充分理解,可能解释了知识定位与编辑之间的不匹配现象。

链接: https://arxiv.org/abs/2606.21345
作者: Hail Hochman,Natalie Shapira,Yoav Goldberg
机构: Bar-Ilan University (巴伊兰大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) store and recall factual knowledge, yet the precise mechanism of how entity representations are transformed to enable specific attribute retrieval remains underexplored. In this work, we investigate this mechanism through the lens of an “attribute-computation path”-a sequence of computational steps over the entity representation required to elicit a target attribute. We then propose an iterative patching protocol to identify a minimal subset of layers necessary for this computation. Applying our method to LLaMA 3.1 8B and Qwen3 8B, we find that these paths are non-contiguous, often skipping layers, and that models possess multiple, functionally-equivalent paths for the same entity and fact, highlighting a high degree of redundancy in attribute computation. This implies that knowledge computation is highly distributed, potentially explaining the localization-editing mismatch and suggesting that knowledge storage and retrieval in LLMs is far from being well understood.

[NLP-164] Synthetic Audio Generation Framework for Air Traffic Control Speech Recognition INTERSPEECH2026

【速读】: 该论文旨在解决自动语音识别(ASR)系统在航空交通管制(ATC)等特定领域中因通道噪声强、非母语(L2)英语口音普遍以及真实标注数据稀缺而导致的识别准确率下降问题。其解决方案的关键在于构建一套针对声学特性进行模拟的合成数据生成流程,通过融合多种神经生成技术——包括文本到语音(Text-to-Speech, TTS)、语音转换(Voice Conversion)、L2到L1口音转换,以及一种新型可调控的L1到L2口音转换框架——以高效生成具有真实感且涵盖多样口音特征的合成语音数据。实验基于Whisper模型在ATCO2数据集上的结果表明,仅使用合成数据或结合真实与合成数据进行微调,均能显著降低词错误率(Word Error Rate, WER),优于未经微调及仅使用真实数据的基线模型。

链接: https://arxiv.org/abs/2606.21340
作者: Raphaël Bagat,Zhe Zhang,Junichi Yamagishi,Irina Illina,Emmanuel Vincent
机构: Université de Lorraine, CNRS, Inria, LORIA (洛林大学, 国家科研中心, 法国国家信息与自动化研究所, 洛林计算机科学研究中心); National Institute of Informatics (日本国立情报学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems, despite achieving remarkable accuracy in general-purpose domains with native speech (L1), struggle in domains like Air Traffic Control (ATC) due to strong channel noise, a presence of non-native (L2) English accents, and data scarcity. We propose a synthetic data generation pipeline with acoustical properties simulations specifically designed to address this lack of real data to improve recognition accuracy in the ATC domain. Our approach leverages a combination of neural generation techniques, including Text-to-Speech, Voice Conversion, L2-to-L1 accent conversion, and a novel controllable L1-to-L2 accent conversion framework built to simulate accented speech. Our experiments with the Whisper model on the ATCO2 corpus demonstrate that fine-tuning with either synthetic data alone, or a mix of real and synthetic data, significantly improves the word error rate over out-of-the-box and real data only baselines respectively.

[NLP-165] LISE : Listenable Interpretable Speaker Embeddings INTERSPEECH2026

【速读】: 该论文旨在解决深度神经网络驱动的自动说话人验证(ASV)系统中嵌入表示缺乏可解释性的问题,即现有模型生成的说话人嵌入难以提供结构化且可被人类感知验证的声学特征说明。传统方法依赖人工标注的说话人属性或引入未经听觉验证的替代表示,存在标注成本高或可解释性不可靠的局限。为此,论文提出了一种无标签的可听可解释说话人嵌入(Listenable Interpretable Speaker Embeddings, LISE)框架,通过将预训练的说话人嵌入分解为少量可解释成分,构建具有结构化的表示形式,从而支持对嵌入所编码信息的分析。其关键创新在于:在保持 ASV 性能(x-vector 和 ECAPA-TDNN 上仅产生可忽略的等错误率,EER,退化)的同时,通过听觉实验验证了这些分解成分对人类听者的可听可解释性——参与者在区分说话人时达到了 83.9% 的准确率,证明了该方法在兼顾性能与可解释性方面的有效性。

链接: https://arxiv.org/abs/2606.21305
作者: Xiaoliang Wu,Chongxin Gan,Ke Liu,Peter Bell,Jennifer Williams
机构: University of Southampton (南安普顿大学); The Hong Kong Polytechnic University (香港理工大学); University of Edinburgh (爱丁堡大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Deep neural network-based automatic speaker verification (ASV) systems achieve impressive performance but their embedding representations remain opaque, lacking a structured and perceptually verifiable explanation of the vocal characteristics they encode. Existing approaches either require annotation of speaker attributes or introduce alternative representations whose interpretability is unvalidated with listeners. We propose Listenable Interpretable Speaker Embeddings (LISE), a label-free framework that decomposes pretrained speaker embeddings into a small set of components. This decomposition yields a structured representation that supports the analysis of what information has been encoded by speaker embeddings. LISE preserves ASV performance with negligible EER degradation on x-vector and ECAPA-TDNN. Crucially, the interpretability of these components for human listeners is demonstrated through listening experiments, where participants distinguished speakers with 83.9% accuracy.

[NLP-166] ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM -Based Agents

【速读】: 该论文旨在解决多步大语言模型(LLM)智能体在强化学习中因依赖标量奖励而难以进行细粒度信用分配的问题。现有基于评分标准(rubric-based)的奖励机制虽提升了可解释性,但通常仅在轨迹层面打分且依赖封闭源代码的评判模型,导致步骤级奖励分配无法实现,且评分函数静态不变。为此,论文提出ARCO(Adaptive Rubric CO-evolution)框架,其核心创新在于采用共享主干网络的统一模型μ,包含生成头(用于生成每一步的评价标准)与评分头(用于预测条件化于标准的步骤级奖励)。通过轨迹分解约束,确保各步骤奖励之和等于最终结果,从而在无步骤级标签的情况下实现信用分配;同时,模型μ与策略π在在线策略数据上联合优化,使评价标准内容与评分函数在参数层面协同演化。实验表明,ARCO在HotpotQA、2WikiMultiHopQA和MuSiQue三个基准上均优于强基线(包括基于结果、评分标准和过程奖励的方法),且分析显示其生成的评价标准具备步骤特异性、对设计选择鲁棒,并能有效诊断智能体行为。

链接: https://arxiv.org/abs/2606.21262
作者: Zihang Tian,Jingsen Zhang,Rui Li,Xiaohe Bo,Yuanzi Li,Xu Chen
机构: Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning for multi-step LLM agents often relies on scalar rewards that indicate success but cannot explain why a trajectory is good or bad. Rubric-based rewards improve interpretability through natural-language criteria, but existing methods score at the trajectory level and freeze the scorer behind a closed-source judge, leaving step-level credit assignment unresolved and the judge itself static. We propose ARCO (Adaptive Rubric CO-evolution), a rubric framework in which a same-scale model \mu shares a backbone with two heads: a generation head that produces per-step criteria, and a score head that predicts rubric-conditioned step-level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step-level labels, while \mu and the policy \pi are jointly updated on on-policy data so that the rubric content and the scoring function co-evolve at the parameter level. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open-source backbones, ARCO improves the best EM in every setting over strong outcome-, rubric-, and process-reward baselines, and analyses show that its rubrics are step-specific, robust to design choices, and useful for diagnosing agent behavior. Codes and data are available at this https URL.

[NLP-167] SCOPE: Sequential Conformal Probing for Reliable OOD Rejection in LLM Services

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)服务中对分布外(Out-of-Distribution, OOD)输入的准确识别与过滤问题,尤其关注在服务边界(service scope)之外的请求如何被及时拒绝,以避免生成不合规或无效内容。现有方法通常依赖模型最终输出或最后一层表示进行OOD检测,但其内部表征中服务边界信号的位置不明确,且缺乏对未见输入的理论保障。为此,本文提出SCOPE(Sequential Conformal OOD Probing and Evaluation)框架,其核心在于:通过选择可读性强的隐藏层,构建基于归纳分布(In-Distribution, IND)校准的置信度门控机制,并利用超鞅型e过程(supermartingale e-process)实现对服务边界证据的持续性验证。实验结果表明,相较于标准的末层检测器,SCOPE在门控层面的拒识性能显著提升,同时揭示了不同类型的OOD边界在隐藏空间中呈现各异的几何结构特征。

链接: https://arxiv.org/abs/2606.21255
作者: Zhuoyun Li,Boxuan Wang,Changshun Wu,Xiaowei Huang,Yi Dong
机构: University of Liverpool(利物浦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rejecting inputs outside the defined in-distribution (IND) service scope is critical for large language model (LLM) services, where unsupported requests should be filtered before full generation. Existing out-of-distribution (OOD) detectors often rely on final outputs or final-layer representations, leaving unclear where service-boundary signals are most clearly encoded inside the model; they also lack a theoretical guarantee for held-out inputs. In this paper, we introduce SCOPE (Sequential Conformal OOD Probing and Evaluation), a framework that selects a readable hidden layer, constructs a conformal gate with IND calibration, and uses a supermartingale e-process to certify persistent service-boundary evidence. Experiments across multiple LLM backbones and six carefully designed boundary conditions show that SCOPE improves gate-level rejection over standard final-layer detectors, while revealing how different OOD boundaries take different geometric forms in hidden space.

[NLP-168] Does RoPE Prevent or Degrade Retrieval Heads? A Mechanistic Analysis Across Model Families

【速读】: 该论文旨在解决生成式模型在长上下文场景中实现高效记忆召回的机制问题,具体聚焦于检索头(retrieval heads)在引入旋转位置编码(Rotary Position Embeddings, RoPE)后是否仍能有效发挥作用。研究核心问题是:RoPE通过频率衰减机制对查询与键进行旋转,是否会抑制或削弱检索头的形成与功能?其解决方案的关键在于通过多维度实证分析揭示检索头的因果必要性及与RoPE频率特性的内在关联。研究发现:(1)检索头具有因果必要性,屏蔽其会显著导致召回率从1.00降至0.00,而随机屏蔽则无影响;(2)更高theta值并未减少检索头数量,否定“抑制形成”的假设;(3)检索头的范数-效用关系呈现家族特异性且方向相反,排除theta作为主导因素;(4)关键突破在于控制实验显示,仅零化检索头对应的低频RoPE维度即可剂量依赖性地破坏召回性能(如128维中32维被置零时召回率由1.00降至0.18),而随机维度不受影响,证明真正起作用的因果变量是RoPE频率而非范数-效用关系。这一结果表明,检索头的功能依赖于其与特定低频旋转分量的耦合,而非整体激活强度。

链接: https://arxiv.org/abs/2606.21249
作者: Cengizhan Bayram
机构: 独立研究者(Independent Researcher)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 25 pages, 3 figures, 18 tables. Code, data, and a paired-seed reproducibility harness: this https URL

点击查看摘要

Abstract:Retrieval heads, attention heads that copy information from earlier context to the current position, have been proposed as the mechanistic substrate for long-context recall. Rotary position embeddings (RoPE) rotate queries and keys by frequencies decaying with a base hyperparameter theta, and a natural hypothesis is that this rotation either prevents retrieval heads from forming or degrades their function. We test both across four open-weight 7-8B models spanning multi-head and grouped-query attention and a 100x range of theta, using paired-seed needle-in-a-haystack tests, layer-clustered permutation, and causal head-masking. (i) Retrieval heads are causally necessary: masking the 87 detected heads in OLMo-2 collapses recall from 1.00 to 0.00, while masking matched random heads has no effect; this replicates in Qwen. (ii) Higher theta does not reduce retrieval-head count (LLaMA-3.1 at theta=500K has 47 heads vs LLaMA-2 at theta=10K with 42), refuting the prevention hypothesis. (iii) The norm-utility relation is family-specific and significant in opposite directions (Qwen d=-0.49, OLMo d=+0.50, both significant; LLaMA null); since OLMo and LLaMA-3.1 share theta=500K yet differ, the effect is not theta-driven. (iv) Building on Chiang and Yogatama (2025), a controlled patch shows that zeroing the lowest-frequency RoPE dimensions of retrieval heads degrades recall dose-dependently (1.00 to 0.18 when 32 of 128 dimensions are zeroed, vs 0.98 for random dimensions); the effect is head-specific and task-specific. The causal variable is RoPE frequency, not norm-utility. The direction holds in all five models patched (OLMo-2, Qwen2.5-7B/14B, Gemma-2, Mistral) across four lineages and two scales. We do not claim cross-model magnitude. Code and a paired-seed harness are released.

[NLP-169] OpenWER: Improving Cross-Lingual ASR Evaluation and Enabling Token-Based Accuracy Metrics

【速读】: 该论文旨在解决多语言自动语音识别(ASR)评估中因通用词错误率(WER)指标在低资源语言上表现不一致而导致的公平性问题。现有改进或替代WER的方法多集中于英语,忽视了对低资源语言的适配性,导致跨语言比较难以公正进行。其解决方案的关键在于提出OpenWER——一个开源实现,通过引入语言特定的归一化处理与复合词检测机制,显著提升了WER在不同语言中的鲁棒性;同时采用基于词元(token-based)的Levenshtein对齐策略,在保留互补评估指标的基础上支持元数据嵌入,从而实现细粒度的准确率分析。实验基于52种语言的评估显示,相较于主流工具库,OpenWER可实现最高达25%的绝对WER降低,有效增强了ASR研究中跨语言评估的可靠性与全面性。

链接: https://arxiv.org/abs/2606.21237
作者: Korbinian Kuhn,Gottfried Zimmermann
机构: Stuttgart Media University (斯图加特媒体大学); University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Advances in deep learning and end-to-end Automatic Speech Recognition (ASR) have enabled robust multilingual models, but evaluation metrics remain limited in assessing accuracy. Efforts to improve or replace the common metric Word Error Rate (WER) often focus on English, leaving evaluations for low-resource languages under-explored and hindering fair cross-lingual comparisons. We present OpenWER, an open-source implementation that improves WER robustness through language-specific normalisation and compound word detection. A token-based Levenshtein alignment preserves complementary metrics and allows metadata embedding for granular accuracy scores. Our analysis of 52 languages shows absolute WER reductions of up to 25% compared to common libraries. OpenWER contributes to fairness in ASR research by increasing the reliability of WER across diverse languages and enabling more comprehensive accuracy evaluations.

[NLP-170] When Context Misleads: Surprisal Energy and Attention Entropy as Metrics of Coherence Illusions in LLM s

【速读】: 该论文旨在探究荷兰语大语言模型(Dutch language models)是否表现出与人类读者类似的“连贯性错觉”(coherence illusions),即当上下文中的干扰项与后续内容在词汇上匹配时,即使文本整体不连贯,模型仍可能误判其为连贯。其核心问题是:模型如何感知和处理基于回指词(如‘again’和‘too’)的语篇连贯性,以及是否存在与人类认知相似的机制。解决方案的关键在于多维度验证模型对连贯性的判断机制:首先,通过预测困惑度(surprisal)发现模型对不连贯延续的反应更强烈,但若前文存在与后续匹配的干扰项,则其困惑度显著降低,表明模型受表面词汇匹配影响;其次,利用注意力熵(attention entropy)识别出在连贯与不连贯条件下行为差异显著的注意力头,并通过消融实验发现这些头具有跨任务的转移效应,暗示存在共享的语篇处理机制;最后,引入来自联想记忆研究中的“能量”(energy)作为量化语篇连贯性的新指标,进一步揭示了模型内部表征中隐含的连贯性评估机制。综合来看,研究揭示了荷兰语大模型中连贯性错觉的存在,并通过熵与能量等可解释性工具,揭示了其背后跨情境一致的神经机制。

链接: https://arxiv.org/abs/2606.21203
作者: Ece Takmaz,Nitin Kumar,Li Kloostra,Jakub Dotlacil
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Psycholinguistics studies show that human readers fall for coherence illusions: an incoherent discourse can seem coherent simply because a distractor matches what comes next. We investigate whether Dutch language models (6 monolingual and 4 multilingual) show the same behavior on texts that link back to earlier context with words such as ‘again’ and ‘too’. First, we find that surprisal at the critical word tracks human acceptability judgments and eye-tracking data. Models are more surprised by incoherent continuations, but a matching distractor in the prior context reduces this surprisal. Second, attention entropy at the critical position identifies heads that behave differently under coherence vs. incoherence. We find that ablating these heads shows transfer effects across experiments, suggesting a shared mechanism. Third, we introduce energy from the associative-memory literature as a metric to quantify discourse coherence. Taken together, our results show that coherence illusions arise in Dutch LLMs, with entropy and energy exposing mechanisms that operate across settings.

[NLP-171] Beyond Hooking Onto the World: Referential Profiles and the Numerical Structure of LLM Grounding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在新型向量接地(vector grounding)框架下的指称问题,尤其针对当前讨论中对参考关系理解过于简化、以及缺乏对数值实现机制的充分阐释这一核心缺陷。其关键解决方案在于:第一,提出指称应被理解为基于语境、话语层面、情感影响和规范约束的“参照轮廓”(reference profile),而非孤立表达与对象间的固定链接;这种指称关系通过公共使用模式、修正、区分、推理与延续等社会实践得以稳定,而非依赖个体内部的同一表征。第二,强调向量接地必须包含对数值实现机制的解释——LLMs并非通过人类感知、记忆或具身理解获得指称能力,而是通过优化过程参数化了人类世界指向性实践的语言痕迹;在有限的向量空间中,参照轮廓以分布式、可叠加的方式存在,并通过上下文敏感的计算恢复。权重、激活值、注意力调制的隐藏状态、经过softmax训练的对比关系及内积对齐等数学结构,构成了继承性语言关系得以稳定并具有因果作用的关键机制。机械可解释性研究中的实体特征、知识神经元与情绪相关激活方向等发现虽不证明模型具备人类式指称,但支持一个更有限的论点:即大语言模型可能具备衍生性、语言中介、基于轮廓且数值结构化的指称形式。

链接: https://arxiv.org/abs/2606.21195
作者: Joo Yull Rhee
机构: SKKU( Sungkyunkwan University)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, no figures

点击查看摘要

Abstract:This paper revisits the grounding problem for large language models in light of recent vector-grounding accounts. I accept the shift from classical symbol grounding to vector grounding, but argue that the current debate remains incomplete in two respects. First, reference is often treated too thinly, as if it were a fixed link between an isolated expression and an object. I argue instead that reference is profile-based, context-sensitive, discourse-level, affectively shaped, and norm-governed. Even in the human case, reference is publicly stabilized through patterns of use, correction, distinction, inference, and continuation rather than through identical private representations. Second, vector grounding requires an account of numerical realization. LLMs do not acquire reference through human perception, memory, intention, embodiment, or understanding. Rather, through optimization, they parameterize linguistic traces of human world-directed practice. In a finite vector system, referential profiles must be distributed, may be superposed, and are recovered through context-sensitive computation. Weights, activations, attention-mediated hidden states, softmax-trained contrasts, and inner-product alignments are the mathematical sites at which inherited linguistic relations become stable and causally active. Mechanistic interpretability findings, including entity-like features, knowledge neurons, and emotion-related activation directions, provide indirect support for this view. They do not show that LLMs possess human reference. They support a more limited thesis: LLMs may possess derivative, language-mediated, profile-based, and numerically structured forms of reference.

[NLP-172] MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

【速读】: 该论文旨在解决医学视觉-语言模型(Med-VLMs)在生成患者可理解的医学描述方面能力不足的问题,尤其是在美国《21世纪治愈法案》强制要求患者即时获取影像诊断结果的背景下,如何弥合“专家-普通公众”之间的医学语言鸿沟成为临床决策与患者教育中的关键挑战。其解决方案的核心在于提出首个大规模多模态基准与评估框架——MedLayXPlain,该框架包含122,789个基于解剖区域的样本,覆盖8种影像模态,并构建了三层次统一医学语言系统(UMLS)本体层级结构以实现语义标准化。为生成高质量的患者可读描述,研究引入层级本体验证精炼(HOVER)方法,通过患者导向词汇映射、大语言模型(LLM)约束重写及跨模型视觉验证三阶段流程,在保证语义等价性的同时有效抑制幻觉。此外,提出轻量级30亿参数评估器MedLayEval,由270亿参数验证器蒸馏而来,能够从五个临床相关维度量化评估专家与患者语言间的对齐程度,克服传统自然语言生成(NLG)指标与临床判断之间相关性差的问题。实验表明,现有医学视觉-语言模型虽在专家级描述上表现优异,但在患者语言层面显著退化;而通用模型虽更易懂却缺乏临床精度,揭示当前范式均难以满足面向患者的医学沟通需求。

链接: https://arxiv.org/abs/2606.21194
作者: Han Jang,Junhyeok Lee,Songsoo Kim,Chae Young Lim,Hyeonjin Goh,Heeseong Eum,Kyu Sung Choi
机构: Seoul National University (首尔国立大学); Seoul National University Hospital (首尔国立大学医院); Seoul National University College of Medicine (首尔国立大学医学院); Sungkyunkwan University School of Medicine (成均馆大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 40 pages (10 pages main text, 30 pages appendix), 4 main figures, 33 vision-language models benchmarked

点击查看摘要

Abstract:Medical Vision-Language Models (Med-VLMs) achieve strong expert-level performance, yet their ability to generate patient-accessible descriptions remains underexplored. With the 21st Century Cures Act now mandating immediate patient access to diagnostic imaging results, evaluating whether Med-VLMs can bridge this Expert-Lay Gap is both urgent and clinically consequential for patient education and shared decision-making. To this end, we introduce MedLayXPlain, the first large-scale multimodal benchmark and evaluation framework for Medical Lay Language Generation (MLLG). MedLayXPlain-122K provides 122,789 region-grounded samples across 8 imaging modalities from 12 publicly available source datasets, each comprising a medical image with paired expert and lay captions anchored in a three-level Unified Medical Language System (UMLS) ontology hierarchy spanning 7 semantic groups, 43 semantic types, and 2,411 medical concepts. Lay captions are constructed via Hierarchical Ontology-Verified Refinement (HOVER), a three-step pipeline combining patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to enforce semantic equivalence while preventing hallucination. We further introduce MedLayEval, a lightweight 3B evaluator distilled from a 27B verifier that scores expert-lay alignment across five clinically grounded attributes, addressing the poor correlation between standard NLG metrics and clinical judgment. Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.

[NLP-173] Dementia-Agents : A Multi-Modal Multi-Agent System for Dementia Staging and Phenotyping

【速读】: 该论文旨在解决现实临床环境中痴呆症(dementia)诊断中多模态数据异构性与不完整性带来的挑战,突破当前以阿尔茨海默病(Alzheimer’s disease, AD)为中心的二分类或三阶段进展建模范式,实现对具有多种表型、病因和病程阶段的综合征级痴呆诊断。其核心解决方案是提出Dementia-Agents——一种面向临床实践的多智能体框架,通过三步工作流实现:(1)数据智能体将结构化临床记录转化为保留缺失数据信号的语义忠实文本表示,并分发至领域对齐的专家智能体;(2)五个微调后的专家智能体分别生成各临床领域的预测结果;(3)协调智能体基于概率聚合策略输出最终的痴呆分期与表型判定。该方法在包含1,066例患者的两个认知神经科真实队列上验证,相较于单体多模态大语言模型(multi-modal large language models, MLLMs)及先前医疗多智能体系统,在真实世界综合征级痴呆分期与表型识别任务中均展现出更优的诊断性能,同时保持了可解释的领域级决策过程。

链接: https://arxiv.org/abs/2606.21168
作者: Yaling Shen,Maja Christensen,Yiwen Jiang,Jenna Dennison,David Darby,Amy Brodtmann,Zongyuan Ge
机构: University of Melbourne (墨尔本大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Dementia diagnosis requires integrating multi-modal clinical assessments from diverse informants and clinicians under incomplete and heterogeneous data conditions. Yet most AI-driven approaches remain Alzheimer’s disease (AD)-centric, framing the problem as binary AD detection or three-stage AD progression modeling within well-curated research settings. This pathology-driven paradigm overlooks the broader, syndrome-level nature of dementia, which spans multiple stages, phenotypes, and etiologies. In this paper, we propose Dementia-Agents, a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. The framework follows a three-step workflow: (1) a data agent translates structured clinical records into semantically faithful textual representations that preserve missing-data signals and routes them to domain-aligned experts; (2) five fine-tuned expert agents generate domain-level predictions; and (3) a coordinator agent performs probabilistic aggregation to produce final staging and phenotyping decisions. We develop and evaluate Dementia-Agents on a real-world clinical cohort of 1,066 patients from two cognitive neurology services. Compared with monolithic multi-modal large language models (MLLMs) and prior medical multi-agent systems, our approach achieves consistent improvements in diagnostic performance for real-world syndrome-level dementia staging and phenotyping, while preserving domain-level interpretability.

[NLP-174] Who Checks the Citations? Benchmarking Legal Hallucination Detection

【速读】: 该论文旨在解决生成式AI在法律文书起草中广泛存在的引用虚构(citation hallucination)问题,这一现象在律师、法官及自诉人使用AI工具时尤为突出。尽管有预测认为新一代模型将减少幻觉或法院制裁可遏制不当行为,但研究发现近年虚假引用的案件数量持续增长,已超过1000件。其核心解决方案在于构建一个基于真实法院文件的法律引用幻觉分类体系,并提出包含1300个含注入错误的简要文本片段的数据集,用于评估和改进AI系统的检测能力。研究通过基准测试五种模型在代理式(agentic)与非代理式设置下的表现,发现最新模型如GPT-5在代理框架下可实现82.8%的召回率和60.5%的F1分数,但仍难以识别细微类型的错误;同时,代理验证过程计算开销大,平均每个文本片段需16.9步推理,且受限于对商业法律数据库的访问权限,导致顶尖模型的实际效用受限。该研究揭示了技术与政策之间的显著鸿沟,强调了开发可审计、可访问的可靠法律引用核查工具的重要性,并为未来技术优化与制度设计提供了数据基础与政策建议。

链接: https://arxiv.org/abs/2606.21155
作者: Patty Liu,Dominik Stammbach,Peter Henderson
机构: Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attorneys, judges, and pro se filers increasingly use AI to draft legal documents, yet these tools frequently fabricate citations. Despite predictions that newer models would hallucinate less or that court sanctions would deter negligent filers, we found over 1,000 filings containing fabricated citations – with this number growing year-over-year. This study evaluates whether AI-based systems can mitigate these errors by automatically detecting hallucinations. We propose a taxonomy of legal citation hallucinations grounded in actual court filings and introduce a dataset of 1,300 brief excerpts containing injected errors. Benchmarking five models in agentic and non-agentic settings reveals that while the latest iterations perform better – GPT-5 achieves 82.8% recall and a 60.5% F1 score in an agentic framework – all models struggle with subtle error categories. Agentic verification remains resource-intensive, with GPT-5 averaging 16.9 steps per excerpt. Furthermore, restricted information access limits the efficacy of even the best agents. This gap creates policy concerns, as it disadvantages both AI systems and litigants who lack subscriptions to commercial legal databases. Together, our dataset, tools, and policy recommendations provide a foundation for building and auditing reliable legal citation checking tools.

[NLP-175] AdaMem: Learning What to Remember for Personalized Long-Horizon LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期记忆系统中因“无差别记忆”导致的记忆膨胀(memory bloat)问题。传统方法倾向于均匀保留所有对话信息,但在实际应用中,推理成本和有限的上下文窗口限制使得这种策略不可持续;大量无关细节会挤占关键信息,显著降低问答(QA)准确性。其核心挑战在于如何实现有选择性的记忆写入控制,即根据用户角色与需求动态决定何为值得保留的信息。本文提出 AdaMem(自适应记忆) 方法,其关键在于通过用户反馈学习个性化、角色相关的记忆策略:构建一个结构化的角色特定记忆策略(role-specific Memory Policy),并基于每周的问答反馈,采用轻量级、补丁式(patch-style)的自我反思机制进行策略迭代优化,同时具备失败回滚能力以保障稳定性。为验证该方法,研究者构建了 AdaMem-Bench 基准测试框架,模拟多周交互下的渐进式反馈过程。实验结果表明,在两种抽取模型和两种反馈模式下,AdaMem 相较于统一记忆基线(Mem0)可提升最高达 +9.0% 的 QA 准确率,同时将记忆体积减少 9%,有效实现了精准记忆与资源效率的平衡。

链接: https://arxiv.org/abs/2606.21144
作者: Xingyu Chen,Rui Wang,Zhaopeng Tu,Liefeng Bo
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term memory systems for Large Language Model (LLM) agents typically try to \emphremember everything, extracting memories uniformly to retain as many facts as possible. In production, however, inference cost and finite context budgets make this untenable: beyond consolidating raw dialogue into memory, an agent must exert \emphwrite control, efficiently keeping only the information each user actually cares about. Otherwise, long-horizon personalized interactions suffer \emphmemory bloat, where irrelevant trivia crowds out useful information and steadily erodes question-answering (QA) accuracy. We argue that what is worth remembering is role-dependent, and propose \textbfAdaMem (Adaptive Memory), a method that \emphlearns what to remember for each user from feedback. AdaMem maintains a structured, role-specific Memory Policy and refines it from weekly QA feedback through a lightweight, patch-style self-reflection step with failure rollback. To study this setting, we build \textbfAdaMem-Bench, a benchmark that simulates weeks of interaction with week-by-week QA. Across two extraction models and two feedback modes, AdaMem improves QA accuracy by up to \textbf+9.0% over the uniform Mem0 baseline while shrinking memory volume by \textbf9%.

[NLP-176] A Multi-Agent Audit Framework for High-Stakes Reasoning : Evaluation and Interpretability in Clinical Mental Health Screening

【速读】: 该论文旨在解决高风险决策任务中大语言模型(Large Language Models, LLMs)在零样本(zero-shot)场景下普遍存在幻觉(hallucination)和可解释性差的问题,尤其在临床心理健康筛查这一敏感领域,单一模型推理易导致不可靠的诊断结果。其解决方案的关键在于提出一种多智能体审计框架(Multi-Agent Audit Framework),通过模拟多步协同验证流程,将推理过程分解为感知(Perception Agent)、基于知识检索增强生成(Knowledge Retrieval-Augmented Generation, RAG)、思维链(Chain-of-Thought, CoT)临床推断以及批判性审计验证四个阶段。该框架利用模块化LangChain工作流实现端到端集成,在DAIC-WOZ数据集上采用本地部署的开源模型进行实证评估,实验结果表明,相较于单智能体基线,该多智能体管道显著降低PHQ-8抑郁严重度预测的平均绝对误差(MAE)从5.35降至5.02,并通过跨智能体验证轨迹的显式暴露,有效缓解了推理漂移问题,实现了高度可解释的诊断推理过程,为可信赖的人工智能辅助决策提供了可泛化的范式。

链接: https://arxiv.org/abs/2606.21123
作者: Jingchen Ye,Yanpei Yu,Luyao Zhang
机构: Duke Kunshan University(杜克昆山大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-stakes reasoning tasks necessitate transparent and verifiable workflows, yet conventional single-model large language models (LLMs) often struggle with hallucination and low interpretability under zero-shot paradigms. To address this general AI challenge, we propose a Multi-Agent Audit Framework that simulates a collaborative, multi-step verification process. We empirically validate this architecture in the sensitive domain of clinical mental health screening using a modular LangChain workflow. Our framework decomposes the reasoning process into a Perception Agent, Knowledge Retrieval-Augmented Generation (RAG), Chain-of-Thought (CoT) clinical inference, and a critical Audit verification stage. We evaluated this framework on the DAIC-WOZ dataset using locally deployed open-source models. Experimental results demonstrate that our multi-agent pipeline significantly outperforms single-agent baselines, reducing the Mean Absolute Error (MAE) for PHQ-8 depression severity prediction from 5.35 to 5.02. By exposing cross-agent validation traces, the framework mitigates reasoning drift and provides highly interpretable diagnostic rationales, offering a generalizable paradigm for reliable AI-assisted decision support beyond isolated model scaling. We make data and code open access on GitHub for replicability.

[NLP-177] Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在程序性要求严格的领域中生成看似合理但违反规程的错误答案的问题,特别是在突发性神经性耳聋(Sudden Sensorineural Hearing Loss, SSNHL)临床决策场景下,正确管理高度依赖对症状时间、韦伯/林纳音叉检查结果及耳镜检查结果的协议一致性解读。其解决方案的关键在于提出“答案工程”(Answer Engineering),一种无需重训练、不修改模型权重、也不进行全局搜索的确定性运行时与创作层机制,通过局部化、规则引导的干预手段,在标准自回归生成过程中对可见的推理轨迹进行实时修正。实验表明,该方法显著提升了任务合规性:在SSNHL任务中,遵循协议的比例从无引导生成的54.5%提升至83.5%,同时对传导性听力损失对照条件的误接受率从1.6%降至77.9%,整体平衡准确率由42.0%提升至80.7%。研究支持了一种系统性视角,即通过可审计的运行时推理轨迹控制可有效增强协议遵从性,但也揭示了规则覆盖范围、触发可靠性以及诊断优先生成范式等固有局限性。

链接: https://arxiv.org/abs/2606.21121
作者: Victor Lavrenko,Anastasiia Molodnitskaia
机构: PeaceTech VC (和平科技风险投资); Rambam Health Care Campus (拉姆班医疗保健中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 6 figures. Code and data: this https URL

点击查看摘要

Abstract:Large language models can produce confident but protocol-invalid answers in domains where procedural compliance is critical. This paper presents Answer Engineering, a deterministic runtime and authoring layer that applies localized rule-guided interventions to the visible reasoning trajectory during standard autoregressive generation, without retraining, modifying model weights, or performing global search. The method is evaluated on a controlled clinical benchmark for sudden sensorineural hearing loss (SSNHL), where correct management depends on protocol-consistent interpretation of symptom timing, Weber/Rinne tuning-fork findings, and otoscopic findings. In the benchmark, step-by-step reasoning shifted rather than eliminated errors: compliant outcomes for SSNHL decreased from 54.5% under unguided generation to 25.1%, while acceptance on the conductive contrast condition increased from 1.6% to 58.9%. Local trajectory editing increased SSNHL compliance to 83.5% and conductive-case adherence to 77.9%, raising balanced accuracy from 42.0% under reasoning-only generation to 80.7%. The results support a systems-level view in which protocol adherence can be improved through auditable runtime control of reasoning trajectories, while also identifying limitations caused by rule coverage, trigger reliability, and persistent diagnosis-first generation dynamics.

[NLP-178] LLM -Based Multi-Reference Evaluation for Efficient and Robust Assessment of Phrase Break Annotations INTERSPEECH2026

【速读】: 该论文旨在解决语音标注中短语边界(phrase break)评估的可靠性问题,核心挑战在于现有评估方法存在显著局限:单参考评估假设每个话语仅存在唯一正确的分句方式,忽略了语调分句本身具有一对多的自然特性;而依赖人工判断虽灵活但成本高、难以扩展。为此,论文提出基于大语言模型(Large Language Model, LLM)的多参考评估(LLM-based Multi-Reference Evaluation, LMRE)方法,其关键在于利用少量示例即可生成多个合理的分句方案,从而建模语调分句的一对多本质。在涵盖五种策略的1,356条韩语标注数据上,LMRE在人类判断的接受行为与评分相关性方面均优于传统单参考评估,验证了该方法在保持可扩展性的同时实现了多参考支持,凸显了大语言模型在语音评估任务中的潜力。

链接: https://arxiv.org/abs/2606.21098
作者: Younghan Park,Hoyeon Lee,Hawon Jeong,Jong-Hwan Kim
机构: NAVER Cloud(NAVER云); Yonsei University(延世大学); KAIST AI(韩国科学技术院人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Reliable evaluation of phrase break annotations is crucial, as subtle variations in prosodic boundaries directly affect the clarity and naturalness of speech. However, existing approaches exhibit major limitations: single-reference evaluation assumes a unique gold phrasing for an utterance despite multiple valid phrasings, while human judgment, though flexible, is labor-intensive and unscalable. To address these, we propose LLM-based Multi-Reference Evaluation (LMRE) for phrase break annotations that models the one-to-many nature of prosodic phrasing and generates multiple valid phrasings from minimal demonstrations. On a Korean testbed of 1,356 annotations covering five strategies, LMRE shows stronger alignment with human judgment than single-reference evaluation in both acceptance behavior and score correlation. Our findings demonstrate that LMRE effectively achieves both scalability and multi-reference support, highlighting the potential of LLMs for evaluation in the speech domain.

[NLP-179] GRAG : Generic Response-Augmented Generation Framework for Personalized Conversational Systems

【速读】: 该论文旨在解决在资源受限或隐私敏感环境中部署高性能个性化对话代理所面临的挑战,核心问题在于现有训练范式将个性化(personalization)与上下文锚定(contextual grounding)视为单一的、耦合的学习任务,导致语言模型在生成回复时需同时兼顾“说什么”(内容锚定)和“以何种风格说”(个性化表达),从而引发显著的计算与优化难题。这种耦合性往往迫使模型在上下文相关性与个性表达之间做出权衡,产生或缺乏上下文依据、或个性化不足的回应。为突破这一瓶颈,论文提出通用响应增强生成(Generic Response-Augmented Generation, GRAG)框架,其关键在于通过解耦个性化与内容锚定两个目标:利用大容量通用大语言模型(LLM)离线生成的通用响应作为语义与结构骨架,为小型专用模型提供外部引导,从而在有限资源环境下实现高效微调。该方法使模型在微调阶段仅专注于个性化注入,同时保持对对话历史的强上下文锚定。作者构建了基于后融合与前融合的两种架构变体,并在多个涵盖多样化个性化结构的基准数据集上验证了GRAG的有效性,实验结果表明其相比不使用辅助骨架的先进方法,在ROUGE-2和BLEU指标上分别提升高达47%和36%,充分证明了其在资源受限场景下构建具备上下文感知能力的个性化对话系统中的普适性与优越性。

链接: https://arxiv.org/abs/2606.21097
作者: Junfeng Liu,Christopher T. Symons,Ranga Raju Vatsavai
机构: North Carolina State University (北卡罗来纳州立大学); Lirio, Inc.; North Carolina State University (北卡罗来纳州立大学); Lirio, Inc.
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying highly capable personalized conversational agents in resource-constrained or privacy-sensitive environments remains a significant challenge. We identify a fundamental bottleneck in the existing approaches: current training paradigms treat personalization and grounding as a single monolithic learning problem. Under these paradigms, language models are forced to simultaneously address what to say (content grounding) and how to say it in a user-specific way (personalization), which introduces significant computational and optimization challenges. Consequently, contextual grounding is often sacrificed for persona adherence, or vice versa, resulting in responses that are either weakly grounded in the conversational history or insufficiently personalized. In this work, we propose the Generic Response-Augmented Generation (GRAG) framework that decouples these competing objectives by leveraging offline, generic responses from high-capacity, general-purpose LLMs as a semantic and structural scaffold to guide the fine-tuning of smaller, task-specialized models seamlessly in resource-limited environments. By decoupling the content grounding from personalization, GRAG allows the model to focus exclusively on persona injection while remaining firmly anchored to the conversational context. We instantiate the GRAG in two post- and pre-fusion-based architectural variants and evaluate them on multiple benchmark conversational datasets that cover diverse personalization structures. Our results demonstrate that GRAG significantly outperforms state-of-the-art methods that do not use auxiliary scaffolding, yielding up to 47% improvements in ROUGE-2 and 36% in BLEU scores. Ultimately, GRAG offers a generalizable blueprint for building grounding-aware personalized conversational systems in resource-limited environments.

[NLP-180] Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

【速读】: 该论文旨在解决多轮越狱攻击(multi-turn jailbreak)所带来的安全挑战,即攻击者通过渐进式升级、重新表述和角色操控等手段,在对话中分散不安全意图,从而规避逐轮内容审核机制。针对这一问题,论文将多轮越狱检测建模为对话级分类任务,并提出一种高效的分层检测框架。其核心创新在于:通过分别编码各对话轮次生成紧凑的轮次表示,并引入轻量级对话模块以捕捉对话动态,仅在必要时选择性关注细粒度证据,从而避免了传统方法中高成本的长上下文拼接。该方法在包含14,038条对话的严苛评估基准上实现了0.9394的F1分数,显著优于最强基线Claude Opus 4.7(提升0.07),同时将误报率降低一半。消融实验表明,对话模块中结合交叉注意力与自注意力机制可使误报率下降2.26个百分点,验证了各组件的有效性。

链接: https://arxiv.org/abs/2606.21082
作者: Chenhui Hu,Muhammed Salih,Sudipto Guha,Subramanian Srinivasan
机构: Zscaler, Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classification problem and introduce an efficient hierarchical detector that avoids expensive long-context concatenation while retaining cross-turn reasoning. The model encodes individual turns to form compact turn representations and applies a lightweight conversation module that captures dialogue dynamics and selectively attends to fine-grained evidence when needed. On a challenging evaluation benchmark of 14,038 conversations, our approach achieves an F1 of 0.9394, outperforming Claude Opus 4.7, the strongest competing baseline, by 0.07 while halving its false-positive rate. Ablation studies confirm that each architectural component contributes meaningfully, with combining cross-attention and self-attention in the conversation module yielding a 2.26 percentage point reduction in false-positive rate over the self-attention-only variant.

[NLP-181] A Validation-Gated Mechanistic Account of Suicidality Detection in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心理健康应用中(如自杀意图检测)的可解释性与因果推断可信度问题,核心挑战在于如何确保对模型内部表征的因果性判断具有可靠性。其解决方案的关键在于提出一种“验证门控框架”(validation-gated framework),该框架通过严格的实证检验机制来确立模型内部特征的因果作用:仅当模型在任务上表现优于简单词法基线(lexical baseline)时,相关概念才被允许进入分析流程;后续每个属性均需通过与匹配控制组的对比验证。这一机制有效排除了无效或误导性结论,例如发现Llama-3.1-8B-Instruct在DeepSuiMind数据集上无法区分隐含自杀意图与一般心理困扰,因而不进行进一步分析。研究聚焦于二分类自杀检测任务,识别出一个位于模型中层的低秩特征,该特征表现出语义而非关键词依赖的特性,且在消融实验中显著影响决策(随机方向则无此效应),并在三个不同模型家族和三个自杀数据集间重复出现。结合注册匹配控制(自杀与抑郁对照)表明该特征更特异性地编码自杀倾向而非泛化性心理痛苦。尽管通过指令微调可增强模型响应,但该调整也影响无关任务,因此被视为必要但非充分条件。最清晰的模式揭示了“编码”与“使用”的分离:较小模型已具备自杀意图的表征能力,但仅大型模型表现出基于该表征的决策行为。正向证据来自英文Reddit文本,限制了其在临床场景中的直接适用性。

链接: https://arxiv.org/abs/2606.21078
作者: Nafiz Ahmed,Sarah Sharif,Dingjing Shi,Mike Banad
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly proposed for mental-health applications such as detecting suicidal content, raising the question of what they rely on. We study this mechanistically and use it to ask a narrower question: how to make a causal claim about a model’s internal features more trustworthy. Our validation-gated framework, with suicidality detection as a case study, interprets a behavior only after the model is shown to perform it: a concept is admitted only once the model ranks it above a simple lexical baseline, and each subsequent property is tested against a matched control. This discipline yields negative as well as positive results. The gate rules out one task at the outset: on DeepSuiMind (Li et al. 2025), Llama-3.1-8B-Instruct cannot separate implicit suicidal intent from ordinary distress, so we do not analyze it. We turn to binary suicide detection, which it does perform. There we find a mid-network feature that appears semantic rather than keyword-based, is causally implicated in the decision (ablating it degrades the judgment; a random direction does not), is low-rank, and recurs across three model families and three suicide datasets. A register-matched control (suicide versus depression) suggests it tracks suicidality more specifically than general distress. Steering raises the model’s response, but for unrelated questions too, so we treat it as necessary but not sufficient. The clearest pattern separates encoding from use: smaller models already represent suicidality, yet only larger ones appear to act on it. The positive evidence is English Reddit text, which limits the clinical reading.

[NLP-182] OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

【速读】: 该论文旨在解决当前大语言模型(LLM)在生产环境中依赖基于表面毒性的内容过滤机制所面临的根本性脆弱性问题,即假设有害意图与表面毒性强相关,这一假设在实际中极易被攻破。其核心解决方案是提出OTTER(Obfuscated Toxicity-Evading Token Evolution for Rewriting),一种仅需标准API访问的黑盒红队测试框架,通过替换少量(如仅5个)词元即可实现表面毒性与恶意意图的解耦,从而有效规避现有毒性检测系统。实验在457个AdvBench攻击提示上对四个GPT模型进行评估,结果显示平均逃避成功率(ASR)从7.0%显著提升至84.0%。此外,论文首次提供了毒性与绕过行为之间的量化分析及类别级分解,为生产环境中分类器加固提供了可操作的实践建议。

链接: https://arxiv.org/abs/2606.21077
作者: Jerry Wang,Hsin-Ling Hsu,Yi-Cheng Lai,Nai-Chia Chen,Fang Yu
机构: National Chengchi University (国立政治大学), University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Production LLMs increasingly rely on toxicity-based moderation filters as a primary defense, assuming that harmful intent correlates with toxic surface wording. We show this assumption is fundamentally brittle: surface toxicity and adversarial intent can be decoupled by replacing as few as five tokens. We present OTTER (Obfuscated Toxicity-Evading Token Evolution for Rewriting), a black-box red-teaming framework requiring only standard API access, directly targeting the practical constraints of industry security audits. Evaluated on 457 AdvBench prompts across four GPT models, OTTER raises average ASR from 7.0% to 84.0%. We further provide the first quantitative analysis of the toxicity–bypass relationship and a per-category breakdown, translating our findings into actionable recommendations for classifier hardening in production deployments.

[NLP-183] FiLM-Coordinated Dual-Branch Transformer for Global-Local Dependency Modeling in Language Modeling ATC

【速读】: 该论文旨在解决标准Transformer在建模过程中因单一自注意力路径同时处理全局依赖与局部模式而产生的矛盾,即长程结构推理与细粒度局部表征学习之间的权衡问题。其解决方案的关键在于提出一种由FiLM(特征式线性调制,Feature-wise Linear Modulation)协调的双分支Transformer架构:每一层显式地包含一个全局分支和一个局部分支,通过双向FiLM模块实现动态跨分支协同,而非采用简单的拼接或静态加法。该设计的核心思想是将两个分支视为对同一输入的不同依赖视角,因此更适合通过通道级校准而非复杂的令牌级交互来实现信息融合。实验表明,在多个小规模语言建模任务上,该结构在固定轻量配置下均显著优于同宽度单分支基线及弱化版双分支模型;在TinyShakespeare和WikiText-2 1M字符子集上的表现优于同宽度结构基线,多种子实验验证了性能提升的稳定性,机制分析进一步揭示了FiLM能够学习到依赖输入、层间差异及通道选择性的动态调制模式,而非静态缩放。此外,参数匹配的扩展型单分支基线也表明当前设计在参数效率方面仍有优化空间。

链接: https://arxiv.org/abs/2606.21075
作者: Zhiqiang Zhou,Xu Ling,Junliang Dai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures, 7 tables. Small-scale language modeling study on FiLM-coordinated dual-branch Transformer architectures, including multi-seed evaluation, cross-dataset validation, ablation studies, efficiency analysis, and parameter-matched fairness baselines

点击查看摘要

Abstract:Standard Transformers use a single self-attention pathway to model both global dependencies and local patterns, creating tension between long-range structural reasoning and fine-grained local representation learning. We propose a FiLM-coordinated dual-branch Transformer for language modeling, where each layer explicitly contains a global branch and a local branch, and feature-wise linear modulation (FiLM) is used for dynamic cross-branch coordination instead of simple concatenation or static addition. The key idea is that the two branches represent different dependency views of the same input, making channel-wise calibration more suitable than heavy token-level interaction. We therefore design a bidirectional FiLM module in which each branch generates per-channel scaling and shifting parameters to condition the other. Experiments on multiple small-scale language modeling settings show that the proposed structure consistently outperforms same-width single-branch baselines and weakened dual-branch variants under a fixed lightweight configuration. On TinyShakespeare and a 1M-character subset of WikiText-2, the full dual-branch FiLM model achieves the best results among same-width structural baselines. Multi-seed results support the stability of the gains, while mechanistic analyses show that FiLM learns input-dependent, layer-dependent, and channel-selective modulation patterns rather than static scaling. Parameter-matched widened single-branch baselines also indicate that the current design still leaves room for improvement in parameter efficiency.

[NLP-184] Quality and Agreement in Multilabel Emotion Annotation: A Case Study and Evaluation Framework LREC2026

【速读】: 该论文旨在解决情感标注(emotion annotation)中固有的主观性问题,传统自然语言处理(NLP)流程通常假设存在“黄金标准”标签,通过多数投票生成单一标签,并将标注者间的差异视为噪声。然而,这种处理方式忽略了标注过程中存在的合理多元解释。本文通过一个多标签情感标注案例研究,系统分析了标注者行为与聚合策略对一致性评估及下游情感分类器性能的影响。其关键解决方案在于摒弃将分歧强制合并为单一标签的做法,转而采用软投票份额标签(soft vote-share labels,包括强度加权变体),并引入阈值化指标(宏/微F1)与概率对齐度量(伯努利交叉熵软损失,SoftBCE)相结合的评估框架,辅以基于数据的分歧诊断工具。研究发现,标注分歧具有结构性特征,且在模型行为中留下可测量的痕迹:虽然硬标签在F1指标上表现更优,但软监督能更准确反映实际标注者的变异性和不确定性。该工作为多标签情感数据集的设计、标注聚合与评估提供了切实可行的方法论指导。

链接: https://arxiv.org/abs/2606.21069
作者: Emily Öhman,Anna Koufakou
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in the Proceedings of the 1st Workshop on Computational Affective Science, CAS 2026, co-located with LREC 2026. This version corresponds to the published workshop paper

点击查看摘要

Abstract:Emotion annotation is inherently subjective, yet most NLP pipelines still assume “gold” labels, typically produced by majority voting, and treat annotator variation as noise. In this paper, we present a multilabel emotion annotation case study and use it to examine how annotator behavior and aggregation choices affect both agreement estimates and downstream emotion classifiers. Rather than collapsing disagreement into a single label, we represent targets as soft vote-share labels (including an intensity-weighted variant) and evaluate models using both thresholded metrics (macro-/micro-F1) and probabilistic alignment (Bernoulli cross-entropy SoftBCE), alongside data-derived disagreement diagnostics. Across annotation regimes, we show that disagreement is structured and leaves measurable traces in model behavior: hard labels may maximize F1 metrics, while soft supervision yields predictions that better reflect empirical annotator variance and uncertainty. Our results provide practical guidance for designing, aggregating, and evaluating multilabel emotion datasets when multiple interpretations are plausible.

[NLP-185] Demographic Metadata as Construct-Irrelevant Noise in DistilBERT-Based Automated Essay Scoring

【速读】: 该论文旨在解决生成式 AI 在自动作文评分(AES)系统中如何有效融合文本内容与学生人口统计学元数据(demographic metadata)以提升评分准确性与公平性的问题。现有研究表明,人工评分常受学生背景特征影响,而当前针对不同元数据融合策略在训练 AES 模型时对预测性能与评分公平性的具体影响仍缺乏充分研究。本文的关键解决方案是采用一种“朴素的多模态融合策略”(naive metadata concatenation),即直接将分词后的文本与人口统计学元数据拼接后输入基于 DistilBERT 构建的模型。研究结果表明,这种早期融合方式显著降低了模型的整体预测准确率(QWK 从 0.727 下降至 0.656),导致验证损失升高(1.29 vs. 1.25),并加剧了评分偏差,使分数公平性实例从基准测试中的 15 例减少至 12 例,说明该融合策略在实际应用中存在严重缺陷,提示需重新审视多模态信息整合方式以保障 AES 系统的可靠性与公平性。

链接: https://arxiv.org/abs/2606.21066
作者: Teik Peng Ch’ng,Hui Na Chua
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated Essay Scoring (AES) systems are increasingly used to support teachers in managing grading workloads and to provide a supplementary rater in large-scale assessments. While human grading is frequently influenced by students’ demographic characteristics, the efficacy of different strategies for integrating demographic metadata with textual input used to train AES models remains underexplored. This study investigates the impact of a specific multimodal fusion strategy - naive metadata concatenation - on the predictive accuracy, training convergence, and score parity of a DistilBERT-based AES model. A comparative analysis was conducted using the ASAP 2.0 dataset to evaluate a baseline model against an experimental model trained with input that concatenates tokenised text and demographic metadata using a naive multimodal fusion strategy. Evaluated via 10-fold cross-validation, the findings reveal that the early fusion of demographic metadata and the input significantly degrades the model’s overall predictive accuracy. The baseline model achieved a Quadratic Weighted Kappa (QWK) of 0.727, which dropped to 0.656 upon integrating metadata. Furthermore, the experimental model exhibited higher validation loss (1.29) compared to the baseline model (1.25). The experimental model also displayed exacerbated scoring bias, reducing score parity instances from 15 to 12 out of 19 tests.

[NLP-186] Event Ontology Expansion via LLM -Based Conceptualization

【速读】: 该论文旨在解决事件本体(event ontology)扩展中因上下文化触发词表示混杂表面语境变异而难以准确捕捉事件类型概念级语义的问题,从而导致聚类不稳定与层级扩展不可靠。其核心解决方案是提出ConceptE框架,通过大语言模型(LLM)对句子与事件触发词进行提示(prompting),生成简洁的概念名称和自然语言描述,以提取鲁棒的、抽象的概念级语义;随后将该概念语义与触发词信息联合编码,构建与本体层级推理对齐的概念增强表示(concept-enhanced representations)。这一设计显著提升了事件聚类的一致性、层级扩展的可靠性以及类型命名的本体一致性。实验在ACE、ERE和MAVEN数据集上验证了ConceptE的有效性,其在事件聚类任务中BCubed-F1最高提升12.37%,在层级扩展任务中Taxo_F1最高提升6.48%,显著优于现有最先进方法。

链接: https://arxiv.org/abs/2606.21048
作者: Weicheng Ren,Zixuan Li,Long Bai,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Event ontology expansion aims to discover emerging event types from data and extend them to appropriate positions in the existing event ontology… Existing methods typically cluster contextualized trigger representations and attach induced clusters to the ontology based on instance-level similarity. However, ontology expansion requires concept-level semantics that characterize event types, whereas contextualized trigger representations often conflate these semantics with surface contextual variation, leading to unstable clustering and unreliable hierarchy expansion. To address this issue, we propose ConceptE, a conceptualization-enhanced framework for event ontology expansion. ConceptE first derives concept-level semantics by prompting an LLM with the sentence and event trigger, producing a concise concept name and a natural-language description. It then jointly encodes these semantics with trigger information to build concept-enhanced representations aligned with ontology-level reasoning. This representation design supports more coherent event clustering, more reliable hierarchy expansion, and ontology-consistent type naming. Experiments on ACE, ERE, and MAVEN demonstrate that ConceptE consistently outperforms state-of-the-art approaches across all subtasks of event ontology expansion. In particular, it achieves improvements of up to 12.37% in BCubed-F1 for event clustering and 6.48% in Taxo_F1 for hierarchy expansion, demonstrating the effectiveness of the proposed ConceptE method.

[NLP-187] Honeyquest for LLM s: Rethinking Cyber Deception for AI Attackers

【速读】: 该论文旨在解决当前网络欺骗(cyber deception)研究中以人类为中心的假设是否适用于由生成式AI(Generative AI)驱动的自主攻击者的问题。随着具备自主决策能力的AI攻击者迅速涌现,传统基于人类行为模式的欺骗防御机制面临有效性挑战。为此,论文提出了一种基于Honeyquest工具的自动化评估框架,对21个来自10家不同厂商、涵盖多种架构与专业化方向、开放与闭源权重模型以及从80亿到超过1万亿参数规模的大型语言模型(LLM)攻击者进行大规模测试。通过在相同174个侦察查询任务上对比其表现与47名人类参与者构成的基准,研究发现:(1)所有LLM模型陷入欺骗陷阱的概率显著高于人类;(2)人类中存在的“防御性注意力分散效应”在LLM群体中统计上不存在;(3)存在关键的“认知-行动鸿沟”——尽管大多数LLM能在推理过程中准确识别出欺骗陷阱(73.4%仍选择利用),且其推理中的陷阱识别能力与实际受骗行为无显著相关性(Spearman r = +0.08, p = 0.73)。上述结果表明,现有以人类为中心的欺骗理论无法可靠迁移至AI攻击者,揭示了构建面向人工智能本征特性的主动防御框架的紧迫性。

链接: https://arxiv.org/abs/2606.21037
作者: Kerri Prinos,Lilianne Brush,Cameron Denton
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 20 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The empirical foundation of cyber deception relies on human-centered hypotheses, but the rapid emergence of autonomous, AI-enabled attackers challenges whether this foundation transfers to AI agents. To address this, we introduce an automated evaluation framework adapted from the Honeyquest instrument to assess LLM attacker judgment at scale. Our 21-LLM cohort spanned 10 providers, diverse architectures and specializations, open- and closed-weight models, and parameter scales from 8B to over 1T. We evaluated the performance of this LLM cohort (yielding 10,962 responses) against the 47-participant human baseline across an identical set of 174 reconnaissance queries. Our empirical evaluation reveals three key findings that establish LLMs as a distinct attacker class: (1) every model in our cohort falls for deceptive traps at a significantly higher rate than human attackers; (2) the defensive attention-diversion effect observed in humans is statistically absent in our LLM cohort; and (3) a critical recognition-action gap, where LLMs successfully articulate trap recognition in their reasoning but exploit the deceptive elements anyway 73.4% of the time. Across the 21 models, trap recognition in reasoning text did not predict fell-for-trap behavior (Spearman r = +0.08 , p = 0.73 ). Ultimately, these findings demonstrate that human-centered deception hypotheses do not reliably transfer to AI attackers, highlighting the critical need for new research into AI-native active defense frameworks.

[NLP-188] he Metanym Game: A Self-Contained Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在评估其事实准确性时面临的“棘手问题”(wicked problem),即如何在没有黄金标准(golden keys)或参考模型(oracle models)的情况下,实现对模型生成内容真实性的可靠评估。传统基准测试常因训练数据泄露(contamination)而失效,而本研究提出了一种名为“元名游戏”(metanym game)的对抗性词语游戏机制,通过让模型自主生成并相互评判语句,构建了一个无固定测试集、抗污染的类比生成与验证框架。其解决方案的关键在于引入首个已知的谱方法(spectral solution),通过对评价者评分矩阵进行奇异值分解(Singular Value Decomposition, SVD),同时量化模型作为事实生成者与事实判断者的双重能力,从而实现对模型真实性和主观一致性判断力的解耦评估。实验表明,该方法在事实评分上与GPQA Diamond基准的相关性达到皮尔逊相关系数r = 0.92,且结果显示判断能力远比生成能力稀缺:最强生成者多为中等水平评判者,而最敏锐的评判者则仅为普通生成者。为实现可扩展性,系统采用“同侪委员会”(council-of-peers)机制,由表现最优的模型组成官方评测组,其席位可通过自身表现竞争获得,确保基准测试具备自包含、自洽且随时间稳定的特性。

链接: https://arxiv.org/abs/2606.21008
作者: David Nordfors
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 78 pages (main text + four appendices: full generation/evaluation prompts, the anchor submission, and a complete worked council-evaluation example), 1 figure, 13 tables

点击查看摘要

Abstract:The metanym game is a competitive word game for LLMs that measures structural intelligence against established cognitive-science constructs. No content is given in advance; the contestants create all of it – a new kind of analogy test, analogical production falsifiable sentence by sentence, with no fixed test set to leak into training (contamination-resistant by construction). In the council-of-peers benchmark, the contestants also rate each other’s creations. We introduce the first spectral solution, to our knowledge, to the wicked problem of benchmarking LLMs’ factual accuracy without golden keys or oracle models: one singular value decomposition of the evaluators’ ratings matrix yields their competence as both generators and judges of true statements at once. Competence on the subjective criteria comes from each judge’s rating consistency as the yardstick shifts. The factual rating correlates with GPQA Diamond at Pearson r = 0.92. Scored separately, making and judging dissociate – judging is the scarcer skill: the strongest generators are middling judges, the sharpest judge a mid-pack generator. To scale, the strongest players form a council that does the official benchmarking; its seats are contestable – a stronger model earns one on the benchmark’s own rating. The benchmark is entirely self-contained and self-consistent, a stable gauge over time.

[NLP-189] Building Agent Harnesses for Scientific Curation from Multimodal Sources

【速读】: 该论文旨在解决科学发现工作流中从多模态文献数据(如长文本、密集表格和图表)中进行结构化归因(structured curation)的难题,尤其针对当前智能体难以有效整合跨模态证据并完成多片段推理的问题。其解决方案的关键在于提出Beaver——一个集成前沿智能体与多模态证据工具、任务脚手架(task scaffolding)及基于产物的自研究(artifact-grounded autoresearch)的智能体支架系统。该系统将归因过程转化为分阶段、可审计的工作流,并支持迭代的“评估—诊断—修正”闭环机制,通过持久化的运行产物揭示各阶段故障,从而指导系统优化。实验表明,Beaver在黄金标注属性得分(GRAS)上达到81.0,显著优于前沿智能体超23个百分点;消融实验证明,任务脚手架、多模态证据工具与溯源追踪均对性能有实质性贡献,尤其在需跨模态推理与归一化的高价值属性上提升最为显著。结果表明,对于包含多模态证据的科学论文归因任务,智能体支架的设计是决定其性能的核心因素。

链接: https://arxiv.org/abs/2606.21005
作者: Sheng Zhang,Qin Liu,Renqian Luo,Shufang Xie,Reuben Tan,Sean Hayes,Gregory Bryman,Wendong Ge,Roxy Zhang,Oluwaseun Egbelowo,Kelly Yee,Hoifung Poon
机构: Microsoft Research (微软研究院); University of California, Davis (加利福尼亚大学戴维斯分校); Merck Co., Inc. (默克公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific discovery workflows often depend on structured curation from the literature. This is difficult for current agents because the key evidence is scattered across long text, dense tables, and figures, and the final records often require reasoning across multiple evidence fragments rather than copying a single span. We study scientific curation from multimodal sources and introduce Beaver, an agent harness that extracts structured information from scientific papers while preserving provenance to the supporting evidence. Beaver combines a frontier agent with multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. These components turn curation into a staged, auditable workflow and enable an iterative evaluate–diagnose–revise loop, where persistent run artifacts expose stage-localized failures and guide harness updates. Experiments show that Beaver reaches 81.0 on Gold-Referenced Attribute Score (GRAS), an attribute-level measure of agreement with gold curated records, outperforming frontier agents by over 23 absolute points. Ablations show that task scaffolding, multimodal evidence tooling, and provenance traces each contribute meaningfully to performance, while attribute-level analysis shows the largest gains on high-value attributes that require cross-modal reasoning and normalization. These results show that, for scientific curation from papers with multimodal evidence, harness design is a central determinant of agent performance.

[NLP-190] Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet

【速读】: 该论文旨在解决多语言语言模型在跨语言性能上的不均衡问题,其根源可追溯至分词阶段。现有主流的子词分词方法倾向于高资源语言,而无分词器方法在字符字节比(bytes-per-character ratio)较高的书写系统中仍会产生更长的序列。为此,论文提出采用国际音标(IPA)作为语言无关的输入表示形式,以构建多语言分词器。其解决方案的关键在于利用IPA具备紧凑的符号集合、更高的跨语言字符重叠度以及更均衡的字节-字符分布特性,从而实现对多种语言和书写系统的统一高效表征。通过在24种语言和14种书写系统上训练文本与IPA子词分词器的匹配对,实验表明,基于IPA的分词器在非拉丁语系中显著提升分词质量,并在未见语言和书写系统上展现出更强的泛化能力。

链接: https://arxiv.org/abs/2606.20993
作者: Milan Miletić,Julie Kallini,Ekaterina Shutova
机构: University of Amsterdam (阿姆斯特丹大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representation for multilingual tokenizers. IPA provides a compact symbol inventory, greater cross-lingual character overlap, and a more balanced byte-per-character distribution across languages. We train matched pairs of text vs. IPA subword tokenizers across 24 languages and 14 scripts and demonstrate that IPA tokenizers consistently improve tokenization quality, especially for non-Latin scripts, and generalize more effectively to unseen languages and scripts.

[NLP-191] Right Knowledge Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models

【速读】: 该论文旨在解决大语言模型中存在的参数化时间冲突(Parametric Temporal Conflict, PTC)问题,即模型在参数中同时存储了过时事实与更新后的更正事实,而标准提示(prompting)往往仍会诱发过时答案。其核心解决方案是提出一种三阶段的测试时干预方法——时间吸引子引导(Temporal Attractor Steering, TAS),该方法通过检测潜在的时间冲突、识别冲突关键层,并对隐藏状态进行定向引导,使其向新事实表示靠拢,从而在无需微调或外部检索的情况下实现对过时知识的精准覆盖。实验基于涵盖五个Wikidata关系的8,746条验证数据构建基准,评估了来自三个模型家族的四种开源大模型(Qwen-2.5-1.5B/7B、Mistral-7B-v0.3、Llama-3.1-8B),结果显示单层激活修补即可实现0.72–0.85的答案翻转率;端到端TAS在3个模型上优于匹配的ITI基线,成功解决29%–57%的PTC案例,同时在非冲突查询上保持85%–99%的准确率,证明了在推理阶段可选择性地覆盖过时参数化知识。

链接: https://arxiv.org/abs/2606.20959
作者: Elias Hossain,Sourav Saha,Umesh Chandra Biswas,Sanjeda Sara Jennifer
机构: University of Central Florida (中佛罗里达大学); Mississippi State University (密西西比州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can store both outdated facts and newer superseding facts in their parameters, but standard prompting may still elicit the outdated answer. We formalize this problem as Parametric Temporal Conflict (PTC) and introduce Temporal Attractor Steering (TAS), a three-stage test-time intervention that detects likely conflicts, identifies a conflict-critical layer, and steers hidden states toward newer-fact representations without retraining or external retrieval. We construct an 8,746-record verified benchmark across five Wikidata relations and evaluate four open-weight language models from three families: Qwen-2.5-1.5B/7B, Mistral-7B-v0.3, and Llama-3.1-8B. Single-layer activation patching achieves answer-flip rates of 0.72-0.85 across all models. End-to-end TAS resolves 29-57% of PTC cases while preserving 85-99% accuracy on non-conflict queries, outperforming a matched ITI baseline on three of four models. These results show that outdated parametric knowledge can be selectively overridden at inference time.

[NLP-192] Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning

【速读】: 该论文旨在解决长时运行的语言模型系统中因交互历史累积超过上下文窗口限制而必须持续进行记忆淘汰(memory eviction)所带来的关键问题:传统淘汰策略在移除对当前任务至关重要的信息(如登录时生成的访问令牌或后续调用所需路径)时,会导致任务执行失败。其解决方案的核心在于提出LRE(Learned Relevance Eviction),一种仅几KB大小、纯CPU运行且无需语言模型的轻量级相关性评分器。LRE通过学习识别历史单元中具有负载承载能力(load-bearing)的部分,并以原文提取方式保留这些关键信息,从而避免因误删重要上下文导致的任务中断。实验表明,在相同预算下,无任何基线方法能全面优于LRE;在智能体任务中,LRE在保持整体准确率的同时,相比完整保留历史可减少高达52%的峰值上下文规模,且无需调用压缩器,简单任务上的准确率较无淘汰基线提升27%。在控制实验中,LRE能完成其他策略陷入循环的任务,单任务调用次数减少37%,并成功解决14个其他策略无法完成的任务。在对话记忆场景中,LRE在零神经网络成本下超越密集编码与基于分词剪裁的编码器。下游评估显示,其在LoCoMo阅读任务中以更少68%的token数实现最优的预算内回答质量。此外,LRE的训练可完全无需人工标注——仅依赖系统自身行为即可恢复95%监督学习评分器的效能,验证了其在缺乏外部监督下的实用性。论文进一步指出,大语言模型智能体中的记忆淘汰本质上是保真度(fidelity)问题,需部署一种可主动预判未来查询但尚未知未来查询内容、且精确状态至关重要的前向型淘汰策略,而低成本的可学习相关性机制已足够应对该挑战。

链接: https://arxiv.org/abs/2606.20954
作者: Nusrat Jahan Lia,Aritra Mazumder
机构: University of Dhaka(达卡大学); University of Utah(犹他大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-running language-model systems accumulate interaction history that outgrows the context window, so they must continually evict. When an eviction policy drops a load-bearing detail, for example an access token issued at login or a path the next call needs, the action fails. We present LRE (Learned Relevance Eviction), a few kilobytes, CPU-only, language-model-free scorer that learns which units of history are load-bearing and keeps them by verbatim extraction. Under a matched-budget comparison, in our experiment, no baseline dominates LRE on the accuracy-cost plane. On agents, LRE matches the accuracy of keeping the entire history overall. On the simplest tasks, it exceeds that no-eviction baseline by 27%, while requiring zero compressor calls and reducing peak context size by up to 52%. A controlled study trace shows LRE completes tasks where the others loop, finishing one such task in 37% fewer calls than keeping everything and solving 14 tasks where no other run policy does. On conversational memory, LRE outranks dense and token-pruning encoders at zero neural cost. In downstream evaluation, LRE gives the best budgeted answer quality on LoCoMo reading 68% fewer tokens. Its supervision can also be annotation-free: training only on the system’s own behavior recovers 95% of the supervised scorer’s effectiveness. We argue that, because memory eviction in LLM agents is a fidelity problem, it requires a deployable proactive policy where the future query is unavailable and exact state is decisive, and that cheap learned relevance can be sufficient.

[NLP-193] Scaling Diverse Language Generation for 3D Visual Grounding

【速读】: 该论文旨在解决3D视觉定位(3D visual grounding, 3DVG)中模型泛化能力不足的问题,即现有模型难以超越简单的语言模式,在复杂多样的自然语言描述下实现对三维场景中实体的准确定位。其核心挑战在于缺乏大规模、多样化且具有丰富约束类型的描述数据,导致模型训练受限于有限的语言表达形式,尤其在区分相似物体时表现不佳。为此,论文提出ViGiL3D++,一种可扩展、场景无关的生成方法,通过结合场景图中的约束采样与大语言模型(LLM)的语言生成能力,系统性地构建多样化的视觉定位查询。该方案的关键在于将结构化场景知识(如属性、关系等约束)与生成式语言模型的语义灵活性相结合,从而生成涵盖多种语言风格和空间约束的高质量、高多样性标注数据。实验表明,ViGiL3D++显著提升了现有3DVG基准上的模型性能,并揭示了当前视觉-语言模型(VLMs)在复杂推理与语义理解方面的潜在局限性。

链接: https://arxiv.org/abs/2606.20946
作者: Austin T. Wang,Dongchen Yang,Angel X. Chang
机构: Simon Fraser University (西蒙菲莎大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 14 figures, 16 tables. Project Page: this https URL

点击查看摘要

Abstract:Developing robust models for 3D visual grounding (3DVG), the localization of entities in a 3D scene described in natural language, is important for enabling agents to correspond spatial language with objects in the physical world. However, the lack of diverse descriptions at scale prevents models from generalizing beyond simple linguistic patterns. Recent such attempts lack diversity in the constraint types and language used to ground objects. Captioning methods cannot precisely contrast objects, which is important for visual grounding. We therefore propose ViGiL3D++, a scalable, scene-agnostic method that generates diverse visual grounding queries by combining constraint sampling in scene graphs with the language generation of LLMs. We show that it has greater diversity over existing scaled datasets and improves model performance over several 3DVG benchmarks but also illuminates outstanding limitations of VLMs.

[NLP-194] Comparing Transformers and Hybrid Models at the Token Level

【速读】: 该论文旨在解决混合语言模型(Hybrid Language Models)在性能提升背后的驱动机制不明确的问题,具体聚焦于:究竟是哪些数据特性或能力提升了混合模型相对于纯注意力模型(如Transformer)的表现,以及这种提升在多大程度上反映了其理论设计优势。其解决方案的关键在于通过对比同构的Transformer与混合模型在相同前缀和目标标记下的损失表现,结合自然标记分类、复制特征、分隔符结构及受控合成探针等多维度分析,揭示模型差异的本质。研究发现,混合模型在开放类内容词上的损失降低最为显著,而在封闭类功能词上收益较小;在文本、代码和标记语言中,其优势主要体现在对起始分隔符的预测上,而对结束分隔符的预测优势几乎消失,且在重复n-gram任务中无明显优势。合成探针实验进一步表明,混合模型在需要记忆指代和实体追踪的任务中占优,而纯注意力模型在需匹配括号等依赖局部语法结构的任务中表现更佳。这说明混合架构中的循环层有效增强了对文档语义状态的建模能力,而注意力机制则更擅长处理可通过n-gram复制或句法结构预测的模式。研究最终提出基于标记级别的分解评估方法,为混合架构的预训练诊断提供了可操作的验证框架。

链接: https://arxiv.org/abs/2606.20936
作者: Yanhong Li,William Merrill
机构: Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hybrid language models that mix attention and recurrent layers have shown promise: theoretically, recurrent layers ameliorate the limitations of pure transformers on state tracking, and empirically, hybrids can outperform pure transformers in loss and downstream evaluations \citepwaleffe2024empirical,merrill2026olmohybrid. Yet it remains unclear which data or capabilities drive these gains, and to what degree they reflect the theoretical advantages motivating hybrid models. We address this question using the open weights from Olmo 3 \citepolmo2025olmo3 and Olmo Hybrid \citepmerrill2026olmohybrid: we compare the loss of a matched transformer and hybrid at the same target tokens under the same prefixes, stratifying the results by natural token tags, copy features, delimiter structure, and controlled synthetic probes. The hybrid has lower loss on most tag families, but the gains are not uniform: they are largest for open-class content words and smaller for many closed-class function words. Across prose, code, and markup, the hybrid’s loss advantage is larger on opening delimiters than on the corresponding closing delimiters, and nearly vanishes on repeated n -grams. Synthetic probes show the same split: the hybrid is favored on pronoun-memory and entity-tracking tasks, whereas the transformer is favored on bracket-matching tasks that require choosing closing delimiters. These patterns suggest that the recurrent layers in hybrids improve predictions that leverage the semantic state of a document, whereas attention helps on tokens predictable by n -gram copying or syntactic bracket matching. We conclude with proof-of-concept filtered evaluations showing how token-level decompositions can sharpen pretraining diagnostics for hybrid architectures.

[NLP-195] Peeking Inside LLM s: Leverag LLM s: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

【速读】: 该论文旨在解决生成式 AI(Generative AI)在法律领域应用中因模型幻觉(hallucination)导致的输出不可靠问题,特别是在高风险场景下,如何有效检测大语言模型(Large Language Models, LLMs)生成结果的正确性。其解决方案的关键在于利用 LLM 内部产生的隐含特征(internal artifacts),如注意力分布、隐藏状态或激活模式等,作为判别依据,构建下游分类器以识别模型输出中的错误预测。研究通过在保释决定预测和法律条文违例判定两个典型法律分类任务上进行验证,表明这些内部特征能够可靠地反映模型预测的准确性,从而为提升基于 LLM 的法律分类系统的可信度提供了有效途径。

链接: https://arxiv.org/abs/2606.20929
作者: Sudipta Santra,Debtanu Datta,Saptarshi Ghosh
机构: Indian Institute of Technology Kharagpur (印度理工学院加尔各答分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the International Workshop on Automated Semantic Analysis of Information in Law (ASAIL) 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being adopted in the legal domain. However, despite their strong performance, LLMs are prone to generating incorrect or hallucinated outputs, raising serious concerns about their reliability in high-stakes domains such as law. Detecting the correctness of responses of LLM-based systems is therefore a critical challenge. In this work, we explore the potential of leveraging internal artifacts of LLM to detect the correctness of their predictions in legal-domain classification tasks. We develop approaches that utilize features derived from these internal artifacts to build downstream classifiers capable of identifying incorrect LLM outputs. We evaluate our approach on two representative legal classification tasks: bail decision prediction and statute violation prediction. Our experimental results demonstrate that LLMs’ internal artifacts are reliable indicators for detecting incorrect predictions in legal classification tasks, and can be applied to enhance the reliability of LLM-based classification systems.

[NLP-196] Latent Personal Memory: Represent personal memory as dynamic soft prompts

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)个性化过程中如何高效、可扩展地编码长期用户行为模式的问题,同时确保与冻结基础模型的兼容性。现有方法在实现个性化时面临计算开销大、参数量高或上下文记忆效率低等挑战。其解决方案的关键在于提出一种名为“潜在个人记忆”(Latent Personal Memory, LPM)的可扩展框架:该框架将用户历史抽象为一个紧凑且持久的N个潜在槽位(latent slots)矩阵,具备可解释性;通过共享的交叉注意力投影网络,将这些槽位动态映射为依赖输入的软提示(soft prompts),并将其前置注入冻结的LLM输入序列中。实验结果表明,LPM在PersonaMem v1和LoCOMO基准测试中显著优于LoRA和提示调优(Prompt Tuning),在PersonaMem v1上分别提升整体准确率达8.8%和54.4%,同时将键值缓存(KV-cache)使用量降低超过64倍;在LoCoMo上以仅120倍更少的可训练参数达到与LoRA相当的性能。此外,LPM的效率随上下文长度增加而提升,在128K上下文长度下甚至超越全上下文处理方式,展现出卓越的可扩展性与推理效率。

链接: https://arxiv.org/abs/2606.20911
作者: Debrup Das,Avinash Amballa,Yashas Malur Saidutta,Vijay Srinivasan,Vivek Kulkarni,Srinivas Chappidi
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Samsung Research America (三星研究美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Personalizing large language models (LLMs) requires encoding long-term, user-specific behavioral patterns in a way that is computationally efficient, scalable, and compatible with a frozen base model. We present Latent Personal Memory (LPM), a scalable framework that represents user-specific history as a compact, persistent matrix of N latent slots, that are interpretable. A shared cross-attention projection network maps these slots into dynamic, input-conditioned soft prompts that are prepended to the input of a frozen LLM. We evaluate LPM on PersonaMem v1 and LoCOMO benchmarks across Qwen3-1.7B, 4B, and 8B backbones. Results demonstrate that LPM outperforms LoRA and Prompt Tuning by up to 8.8% and 54.4% in overall accuracy respectively on PersonaMem v1, while reducing KV-cache usage by over 64x. On LoCoMo, LPM matches LoRA accuracy with 120x fewer trainable parameters. We also show that the efficiency of LPM grows with context length and outperforms full-context at 128K context length.

[NLP-197] Storyline Trees: Hierarchical Representations for Long-Form Narratives

【速读】: 该论文旨在解决长文本叙事(long-form narratives)在长上下文模型中难以有效处理的问题,其核心挑战在于叙事结构具有隐含性:事件、人物与情节线跨越数百页内容相互交织,缺乏如结构化文档中明确的导航线索。为此,论文提出构建故事线树(storyline trees),这是一种层次化表示框架,能够从全局主题和主要情节线到细粒度事件逐层组织叙事内容。其解决方案的关键在于:首先将章节划分为连续的叙事单元(即场景,scenes),作为构建故事线树的基本单元;随后通过互补的自上而下与自下而上的过程,在多层级抽象水平上推断、精炼、聚类并总结故事情节。该表示方法支持自适应检索(adaptive retrieval),使模型可迭代地审视高层叙事结构,并按需检索场景级证据。在三个长上下文叙事问答基准上的实验表明,自适应检索显著优于强基线方法,包括微调后的长上下文模型及基于代理的分块方法;消融实验进一步验证了以“场景”为基本单元相较于章节或通用分割更高效,且在匹配的检索预算下性能优势依然存在。

链接: https://arxiv.org/abs/2606.20900
作者: Litu Ou,Mirella Lapata
机构: University of Edinburgh(爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-form narratives are challenging for long-context models because their structure is implicit: events, characters, and plotlines interact across hundreds of pages without the explicit cues that guide navigation in structured documents. We address this by constructing storyline trees, hierarchical representations that organize narratives from global themes and major plotlines to fine-grained events. We first segment chapters into contiguous narrative segments, or scenes, and use them as the basic units for tree construction. We then infer storyline trees through complementary top-down and bottom-up procedures that derive, refine, cluster, and summarize storylines at multiple levels of abstraction. We showcase the utility of this representation for question answering: storyline trees enable adaptive retrieval, allowing models to iteratively inspect high-level narrative structure and retrieve scene-level evidence on demand. Experiments on three long-context narrative QA benchmarks show that adaptive retrieval outperforms strong baselines, including post-trained long-context models and agentic chunk-based methods. Ablations confirm that scenes are more effective basic units than chapters or generic segmentation, and that gains persist under matched retrieval budgets

[NLP-198] PeerCheck: Enhancing LLM -Generated Academic Reviews Towards Human-Level Quality

【速读】: 该论文旨在解决生成式 AI(Generative AI)在学术评审中应用时存在的质量与人类评审不一致的问题。随着学术投稿量持续增长,传统同行评审机制面临效率瓶颈,而大语言模型(LLM)虽被广泛用于辅助评审,但其生成的评审意见在关注重点、深度和准确性上仍与人类存在显著差异。论文提出PeerCheck框架,核心在于系统分析LLM与人类评审之间的差异(研究问题RQ1),并探索提升LLM生成评审质量的有效方法(研究问题RQ2)。关键解决方案包括:采用思维链(Chain-of-Thought, CoT)提示工程以增强推理逻辑性,以及引入检索增强生成(Retrieval-Augmented Generation, RAG)技术以提升信息丰富度。研究发现,CoT显著提升了评审质量,但意外揭示了“RAG悖论”——即不同LLM对RAG的响应效果不一,且在某些情况下反而降低了评审质量,暴露出当前RAG在复杂任务中的不可靠性。该研究为构建更符合人类评审标准的智能评审系统提供了重要实证依据与优化方向。

链接: https://arxiv.org/abs/2606.20897
作者: Zeyuan Chen,Ziqing Yang,Yihan Ma,Michael Backes,Yang Zhang
机构: CISPA Helmholtz Center for Information Security (CISPA亥姆霍兹信息安全中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As academic submissions grow, the traditional peer review process struggles to keep up, raising concerns about quality and fairness. A trend of using large language models (LLMs) for assistance has emerged. In this work, we take a critical step toward improving the quality of LLM-generated reviews. We propose the PeerCheck framework, which investigates LLM-human review differences (RQ1) and explores methods to improve LLM-generated review quality (RQ2). We first analyzed the human-written reviews with reviews generated by various LLMs and found that LLMs and humans focus on different terms, e.g., LLMs prioritize theory while humans emphasize methodology and experiments. We further adopt prompt engineering, such as Chain-of-Thought (CoT), and utilize retrieval-augmented generation (RAG) to enhance the LLM-generated reviews towards human-level quality. We find CoT significantly improves the quality of LLM reviews, while we discover an unexpected “RAG paradox,” i.e., experiments with RAG produce different results for various LLMs and, in some cases, even reduce review quality. Our comprehensive analysis of LLM-generated academic reviews illustrates both possibilities and limitations, contributing to a more effective, human-aligned review system. Our dataset is available on this https URL.

[NLP-199] SciLens: Multi-modal Scientific Claim Verification with Agent ic Entailment and Grounding KDD2026

【速读】: 该论文旨在解决科学主张验证中因多模态证据(如表格与图表)复杂结构与主张内在语义多样性导致的准确性与可解释性不足问题。现有方法依赖视觉-语言模型进行直接二元判断,难以捕捉主张中数值结果、比较关系、范围限定及解释性上下文等细粒度语义要素,且无法有效对齐证据的结构性信息。其解决方案的关键在于提出SciLens——一种基于证据条件的原子蕴含框架,通过将科学主张分解为“核心实证原子”与“背景原子”,并分别在表格或图像中以行、列、单元格、算术关系、图例、坐标轴、视觉编码、趋势等模态特异性证据痕迹(evidence witness)进行精确锚定,再依据原子级蕴含规则综合判定最终标签。该方法实现了统一的结构化验证流程:仅当所有核心实证原子均被当前证据所蕴含时,主张才被视为支持。在SciClaimEval开发集上,SciLens达到79.2%的宏平均F1和63.1%的配对准确率,证明了结构化代理式验证显著提升了对证据的敏感性与决策过程的可解释性。

链接: https://arxiv.org/abs/2606.20873
作者: Yueming Wang,Tianshi Zheng,Jiaxin Bai,Yangqiu Song,Ginny Wong,Simon See
机构: The Hong Kong University of Science and Technology(香港科技大学); Hong Kong Baptist University(香港浸会大学); NVIDIA AI Technology Center(NVIDIA人工智能技术中心)
类目: Computation and Language (cs.CL)
备注: KDD 2026 SciSoc Agents LLMs (Oral)

点击查看摘要

Abstract:Scientific discovery increasingly relies on automated systems that generate hypotheses, inspect multimodal evidence, and validate claims at scale. Yet scientific claim verification is not well served by asking a vision-language model for a direct binary judgment: claims often combine numerical results, comparisons, scope qualifiers, and explanatory context, while evidence is encoded in tables and figures with distinct grounding structures. We present SciLens, an evidence-conditioned atomic entailment framework for multimodal scientific claim verification. SciLens decomposes each claim into central empirical atoms and background atoms, grounds the central atoms to modality-specific evidence witnesses, and predicts the final label with an atom-level entailment rule. For tables, atoms are grounded to rows, columns, cells, arithmetic relations, and table scope; for figures, they are grounded through panels, axes, legends, visual encodings, categories, trends, ranks, and qualifier checks. This yields a unified validation procedure in which a claim is supported only if every central empirical atom is entailed by the current evidence. On the SciClaimEval development set, SciLens achieves 79.2% macro-F1 and 63.1% pair accuracy, showing that structured agentic validation improves both evidence sensitivity and interpretability.

[NLP-200] Beyond One Language One Script: Quantifying Orthographic Bias in Multilingual VLMs with PuMVR ACL2026

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多语言场景下存在的书写系统偏见问题,即模型普遍假设每种语言对应单一书写系统,而忽视了大量使用多书写系统的语言群体(如旁遮普语、塞尔维亚语、印地-乌尔都语等)所面临的实际挑战。其核心问题是:现有模型在不同书写系统间表现出显著的性能差异,导致对多书写系统语言用户的不公平服务。解决方案的关键在于提出首个针对多书写系统语言的基准测试框架——PuMVR(旁遮普语多模态视觉推理),涵盖375个基于文化情境的图像推理任务,覆盖旁遮普语的三种活跃书写系统(古鲁穆基、沙赫穆基、罗马字母)。通过评估10个主流VLMs,研究揭示了显著的“书写系统差距”(Script Gap),模型在不同书写系统上的准确率差异可达16%,且书写系统一致性率(SCR)低至24.8%。尽管视觉输入能提升整体性能,但无法消除相对偏差,表明模型的推理路径在不同书写系统间缺乏可迁移性。为此,论文提出以SCR作为核心评估指标,推动构建无书写系统依赖的公平评估范式,为实现真正意义上的多语言、多书写系统兼容的生成式人工智能提供方法论基础。

链接: https://arxiv.org/abs/2606.20770
作者: Prabhjot Singh,Bhushan Pawar,Madhu Reddiboina
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 4 figures. Accepted to the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP) @ ACL 2026

点击查看摘要

Abstract:Current Vision-Language Models (VLMs) are celebrated for their multilingual capabilities, yet they operate under a flawed assumption: that one language corresponds to a single writing system. This overlooks billions of users of multi-script languages like Punjabi, Serbian, Hindi-Urdu, Kurdish, among many others, for whom a model’s capability may be fractured by orthographic bias. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), the first benchmark designed to quantify script-dependent bias through 375 culturally grounded image-reasoning tasks across Punjabi’s three active scripts (Gurmukhi, Shahmukhi, Roman). Evaluating 10 state-of-the-art VLMs, we expose a substantial Script Gap: models frequently solve visual puzzles in one script while failing identical tasks in another, with accuracy deltas reaching 16% and Script Consistency Rates (SCR) as low as 24.8%. Crucially, visual input boosts absolute performance but does not close this gap, the relative bias persists. Our analysis suggests reasoning patterns show limited cross-script transferability, and Chain-of-Thought pathways diverge based on script alone. We propose SCR as a core metric for script-agnostic evaluation, challenging current multilingual assessment paradigms and providing a framework for equitable AI.

[NLP-201] FirstPass: Grounding AI Scientific Judgment in Multi-Round Editorial Outcomes ICML2026

【速读】: 该论文旨在解决当前用于同行评审的生成式AI系统在三个关键层面存在的局限性:一是训练数据仅局限于计算机科学与机器学习领域的会议论文,缺乏跨学科代表性;二是忽视了科学验证中至关重要的多轮对话机制;三是评估标准侧重于风格模仿而非真实的编辑判断能力。为应对这些问题,研究提出FirstPass——一个包含3,668个完整多轮同行评审对话的高质量数据集及微调模型。通过利用《自然·通讯》自2022年11月起实施的强制性透明同行评审制度,确保所有对话内容完整性,并通过自动化审计实现100%内容可信度验证。研究采用低秩适应(LoRA)对Qwen2.5-7B-Instruct模型进行微调,涵盖三项核心任务:评审意见生成、审稿人反馈更新与修订周期预测。其关键发现在于“仅响应损失掩码”(response-only loss masking)是模型性能提升的必要前提而非优化策略:无此机制时准确率仅为62.0%,低于多数类基准;而启用后,FirstPass在预测编辑结果(标准修订 vs. 扩展修订周期)方面达到80.5%准确率与78.2% F1-macro,显著优于Gemini-3.1-flash-lite-preview零样本模型10.4个百分点,且所有对比均具有统计学显著性(McNemar p < 0.001)。在生成任务中,FirstPass生成的评审意见平均长度达1,187词,远超其他基线模型,更接近人类参考文本(2,155词),在ROUGE-L指标上取得0.154,相较于Qwen和DeepSeek零样本模型有显著提升(p < 0.001)。将FirstPass部署于投稿前预审环节,作为前瞻性的科学合作者,可模拟专家批判性意见并预测修订周期,为作者提供类似可信合作者的判断支持,且在生物学、化学、神经科学、物理学和地球科学五大领域均表现出一致的跨域性能。

链接: https://arxiv.org/abs/2606.20769
作者: Prabhjot Singh,Somnath Luitel,Manmeet Singh,Josh Durkee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the AI for Science Workshop at the 43rd International Conference on Machine Learning (ICML 2026). 9 pages, 2 figures, 6 tables

点击查看摘要

Abstract:AI systems for peer review fail on three fronts: they train on Computer Science and Machine Learning venues alone, ignore the iterative dialogue that validates science, and evaluate on stylistic mimicry rather than real editorial judgment. We introduce FirstPass, a dataset and fine-tuned model that addresses all three. Curating 3,668 complete multi-round peer-review dialogues from Nature Communications across five scientific domains (biology, chemistry, neuroscience, physics, and earth science), we exploit mandatory transparent peer review (instituted November 2022) and verify 100% content integrity by automated audit. We fine-tune Qwen2.5-7B-Instruct via Low-Rank Adaptation (LoRA) on three tasks: review generation, reviewer updating, and revision-cycle prediction. Our key finding is that response-only loss masking is a prerequisite, not an optimization: without it, accuracy is 62.0%, below the majority baseline; with it, FirstPass achieves 80.5% accuracy and F1-macro 78.2% on predicting editorial outcomes (Standard vs. Extended revision cycles), outperforming Gemini-3.1-flash-lite-preview zero-shot by 10.4 percentage points and all baselines with statistical significance (McNemar p 0.001). On generation, FirstPass produces reviews averaging 1,187 words, substantially closer to human references (2,155 words) than any baseline, achieving ROUGE-L 0.154 with significant gains over Qwen and DeepSeek zero-shot (p 0.001). Deployed in the pre-submission loop as an anticipatory scientific co-author, FirstPass simulates expert critique and predicts revision cycle outcomes before submission, giving authors the judgment a trusted colleague would provide, with consistent cross-domain performance across five disciplines.

[NLP-202] From Sentiment to Actionable Insights: A Data-Driven Public Sentiment Analysis of Advanced Air Mobility

【速读】: 该论文旨在解决先进空中交通(Advanced Air Mobility, AAM)系统在商业化部署过程中面临的公众接受度问题,其核心挑战在于如何准确识别并应对公众对AAM的潜在社会性障碍。研究的关键在于构建一个高精度的情感分析框架:通过对比七种不同范式的文本情感分析模型(涵盖词典基、机器学习、深度学习及基于Transformer的方法),验证ModernBERT在AAM特定语境下的最优性能,并以此对超过30万条来自Reddit和Quora的用户生成文本进行情感标注。在此基础上,结合潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)模型对各情感类别中的隐含主题进行挖掘,揭示了20个随时间演化的主题及其分布特征。进一步的跨情感主题聚类分析识别出六大公众关切领域,包括劳动力与技能发展(25.29%)、监管合规(24.64%)、无人机技术性能(20.99%)、军事与地缘政治(14.58%)、安全与运营风险(8.51%)以及噪声与扰动(5.98%)。该研究通过数据驱动的方法精准定位社会阻力来源,从而提出可操作的策略建议,为提升公众接受度、推动AAM政策制定与商业落地提供科学依据。

链接: https://arxiv.org/abs/2606.20751
作者: Esrat Farhana Dulia,Amina Dhaher,Raiful Hasan,Syed Arbab Mohd Shihab
机构: University of Kent (肯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advanced Air Mobility (AAM) is an emerging low-altitude air transportation system whose successful deployment depends not only on technological advancement but also on public acceptance. This acceptance will drive government support, regulations, noise standards, and willingness to fly, and in turn the overall commercial viability of AAM. Understanding public sentiment toward AAM is therefore essential for identifying its societal barriers and informing its adoption strategies. This study analyzes 306,009 human-generated texts collected from Reddit and Quora to examine public discourse on AAM using AI-based models. Because multiple sentiment analysis models exist, identifying the most accurate model is critical for reliable AAM sentiment prediction and trustworthy public opinion analysis. Accordingly, seven models spanning lexicon-based, machine learning, deep learning, and transformer-based approaches are evaluated for AAM-specific sentiment classification. ModernBERT achieves the best classification performance and is used to label the full dataset. Using the resulting sentiment labels, Latent Dirichlet Allocation (LDA) is applied within each sentiment class to uncover latent topics in public opinion. The analysis identifies 20 distinct topics and traces their temporal evolution from 2008 to 2025. A cross-sentiment topic analysis further reveals six major clusters of public concern: workforce and skill development (25.29% of the dataset), regulation and compliance (24.64%), technical performance of drones (20.99%), military, geopolitics, and defense (14.58%), safety and operational risks (8.51%), and noise and disturbance (5.98%). Based on these findings, this study provides actionable strategies to address these concerns, thereby, improving public acceptance and support AAM deployment.

[NLP-203] VeriBound: PAC-Bayesian Generalization Bounds for Process Reward Models Trained with Formal Verification Tools

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中过程奖励模型(Process Reward Models, PRMs)在训练数据获取上的瓶颈问题,即人工标注成本高昂且蒙特卡洛回溯估计噪声大。尽管近期方法FOVER利用形式化验证工具(如Z3、Isabelle)自动标注步骤级错误标签并观察到跨任务泛化现象,但该现象缺乏理论解释,且无关于泛化误差、样本复杂度、收敛速率或下游Best-of-K性能的正式边界。本文提出VeriBound理论框架,首次为基于形式化验证工具训练的PRMs提供概率近似正确(PAC-Bayesian)泛化界。其关键贡献包括:(i) 建立了基于形式化验证准确率及训练-测试任务分布差异的PAC-Bayesian泛化界,将训练集上的验证误差与未见任务上的期望误差关联;(ii) 推导出样本复杂度结果,表明仅需 $ O(d \log(d/\delta) / \epsilon^2) $ 个形式化验证标注样本即可以概率 1δ1-\delta 实现 ϵ\epsilon 的泛化误差,其中 dd 为PRM假设类的复杂度;(iii) 在L-光滑性和有界方差条件下证明了使用形式化验证标签训练时PRM具有线性收敛速率;(iv) 提出误差传播界,揭示步骤级验证误差对Best-of-K性能退化的定量影响。这些结果共同构建了首个系统性的理论基础,支撑了形式化验证驱动的PRM训练的有效性与可扩展性。

链接: https://arxiv.org/abs/2606.20740
作者: Amirul Rahman,Mohammed Sabih Alsharari
机构: University of Malaya (马来亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) provide step-level verification for Large Language Model (LLM) reasoning, yet their training data acquisition remains a bottleneck: human annotation is costly and Monte Carlo roll-out estimates are noisy. A recent approach, FOVER, trains PRMs on step-level error labels automatically annotated by formal verification tools such as Z3 and Isabelle, and empirically observes cross-task generalization from symbolic tasks to diverse reasoning benchmarks. However, this generalization phenomenon lacks any theoretical explanation, and no formal bounds exist on the generalization error, sample complexity, convergence rate, or downstream Best-of-K performance of such PRMs. We propose VeriBound, a theoretical framework that provides PAC-Bayesian generalization bounds for PRMs trained with formal verification tools. We establish four main results: (i) a PAC-Bayesian generalization bound that relates the empirical verification error on formal-verification-annotated training data to the expected error on unseen reasoning tasks, with the bound depending on the formal verification accuracy and the divergence between training and test task distributions; (ii) a sample complexity result showing that O(d \log(d/\delta) / \epsilon^2) formal-verification-annotated examples suffice to achieve generalization error \epsilon with probability 1-\delta , where d is the complexity of the PRM hypothesis class; (iii) a convergence analysis proving that PRM training with formal verification labels converges at a linear rate under L -smoothness and bounded variance conditions; and (iv) an error propagation bound that relates step-level verification error to Best-of-K performance degradation.

[NLP-204] VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

【速读】: 该论文旨在解决复杂视觉任务中视觉工具(如开集检测器、分割模型及后处理算子)在面对密集目标、遮挡、小目标以及域偏移等挑战时,因固定组合与参数配置导致的性能退化问题。现有视觉编程代理通常生成静态解决方案流水线,缺乏对动态变化场景的适应能力。其核心解决方案是提出一种自适应视觉工具编排框架VTOS(Vision Tools Orchestration Search),通过联合搜索可执行的解决方案程序与诊断性观察者程序实现动态优化。其中,解决方案程序由Grounding DINO、SAM、非极大值抑制(NMS)、切片-检测(slice-and-detect)等视觉工具构成;观察者程序则负责诊断候选方案、识别失败模式并生成可操作反馈。这些观测结果被累积至共享的VisionThoughts知识库中,用于指导后续搜索过程。实验在LVIS-Count上的密集物体计数与PlantSeg-OOD上的零样本植物病害分割两个案例中验证了该方法的有效性,充分展示了在阈值校准、NMS处理、切片策略、掩码优化及域泛化等方面的优势。结果表明,联合搜索解决方案与观察者程序是提升视觉工具在复杂场景下鲁棒性的有效策略。

链接: https://arxiv.org/abs/2606.20728
作者: Jinchao Ge,Lingqiao Liu,Shuwen Zhao,Lei Wang
机构: University of Wollongong(伍伦贡大学); Adelaide University(阿德莱德大学); Tianjin University of Technology(天津理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Vision foundation tools such as open-vocabulary detectors, segmentation models, and post-processing operators are powerful building blocks for computer vision, but their effectiveness depends heavily on how they are orchestrated: which tools are used, in what order, with what parameters, and under what visual conditions. Existing visual-programming agents typically generate a fixed solution pipeline, making them brittle under dense objects, occlusion, small targets, and domain shift. We introduce VTOS (Vision Tools Orchestration Search), a framework for adaptive visual tool orchestration through joint solution–observer search. VTOS co-searches executable solution programs that compose vision tools such as Grounding DINO, SAM, NMS, and slice-and-detect, together with observer programs that diagnose candidate solutions, identify failure modes, and generate actionable feedback. These observations are accumulated in a shared VisionThoughts knowledge base to guide subsequent search. We evaluate VTOS through two case studies: dense object counting on LVIS-Count and zero-shot plant-disease segmentation on PlantSeg-OOD, which stress different orchestration challenges including threshold calibration, NMS, slicing, mask refinement, and domain generalization. Across both tasks, VTOS outperforms static tool pipelines and agentic visual-programming baselines, showing that co-searching solutions and observers is an effective strategy for adapting vision tools to challenging computer vision tasks.

[NLP-205] Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation

【速读】: 该论文旨在解决灰度图像自动着色这一计算机视觉中的核心挑战,即同一灰度输入可能存在多种合理的彩色化结果,导致生成结果缺乏一致性和准确性。其解决方案的关键在于引入文本条件(text conditioning)以引导生成过程,通过结合CLIP文本编码器提供的语义信息,提升生成图像在像素级和感知层面的质量。实验对比了U-Net与Stable Diffusion 1.5两种架构在有无文本条件下的表现,结果表明,文本条件显著提升了峰值信噪比(PSNR)、结构相似性(SSIM)和色彩丰富度(colorfulness),同时降低了学习感知图像块差异(LPIPS),证明了文本条件在跨架构尺度上均能带来稳定且可量化的质量改进。

链接: https://arxiv.org/abs/2606.20722
作者: Colten Reissmann,Hugo Garrido-Lestache Belinchon
机构: 未知
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Grayscale images are commonly found in historical photography restoration, medical imaging, and artistic media. However, automatically applying color to these images remains a significant challenge in computer vision because many plausible colorizations can correspond to the same grayscale input. In this work, we quantify the effect of text conditioning on pixel-level and perceptual metrics for grayscale-to-color image models. Specifically, we compare two architectures, a U-Net and Stable Diffusion 1.5, each tested with and without CLIP text conditioning while holding all other variables constant. Our results show that text conditioning improves PSNR by 5.6%, SSIM by 1.2%, and colorfulness by 36.6%, while reducing LPIPS by 7.6% in the U-Net tier. In the Stable Diffusion tier, text conditioning improves PSNR by 5.8%, SSIM by 1.5%, and colorfulness by 0.6%, while reducing LPIPS by 11.3%. These results indicate that text conditioning provides consistent, measurable improvements to colorization quality across both architecture scales. Subjects: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.20722 [cs.GR] (or arXiv:2606.20722v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2606.20722 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-206] MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data

【速读】: 该论文旨在解决从非侵入性脑信号(如fMRI)中解码内省性语言(inner speech)的难题,核心挑战包括缺乏外在语言输出、训练数据有限以及个体间显著的神经差异。现有脑-文本(brain-to-text)方法通常依赖于任务特定的解码器微调,导致系统可扩展性差且难以适应新受试者。为此,论文提出MindAlign——一种解耦的两阶段脑-语言框架,其关键在于通过两个独立模块实现无需修改预训练语言模型即可生成开放式文本:第一阶段建立受试者特异性的神经-语义对齐,将fMRI活动映射至共享的多模态语义空间,提取内部生成句子的潜在语义草图;第二阶段利用该草图与视觉上下文联合提示一个冻结的多模态语言模型,完成自由形式文本生成。实验表明,该方法在静默图像描述任务中显著优于仅使用fMRI信号或随机基线的方法,并验证了语义到语言的投影具有跨受试者泛化能力,结合个体神经对齐后仍能有效解码。结果表明,神经信号所表征的语义内容不仅受图像驱动先验影响,还包含独立的内在语义信息,支持了一种可扩展、模块化的脑-文本解码新范式。

链接: https://arxiv.org/abs/2606.20696
作者: Muxuan Liu,Ichiro Kobayashi,Satoshi Nishida
机构: Ochanomizu University (御茶の水大学); National Institute of Information and Communications Technology (日本情報通信研究機構)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Preprint. Under review

点击查看摘要

Abstract:Decoding inner speech from non-invasive brain signals remains a fundamental challenge due to the absence of overt linguistic output, limited training data, and large inter-subject variability. Existing brain-to-text approaches often rely on task-specific decoder fine-tuning, which restricts scalability and complicates adaptation to new participants. We propose MindAlign, a decoupled two-stage brain-to-language framework that enables open-ended text generation from fMRI signals without modifying the underlying language model. The first stage learns a subject-specific neural-semantic alignment that maps fMRI activity into a shared multimodal semantic space, extracting a latent semantic sketch of the internally generated sentence. The second stage integrates this sketch with visual context to prompt a frozen multimodal language model for free-form generation. Experiments on fMRI data collected during silent image description demonstrate that the proposed approach consistently outperforms fMRI-only and random baselines. We further show that the learned semantic-to-language projection can generalize across subjects, enabling effective decoding when paired with subject-specific neural alignment. These results indicate that neural signals modulate semantic content beyond image-driven priors, supporting a scalable and modular direction for brain-to-text decoding.

[NLP-207] Specific Domain Ontology Construction Using Large Language Models KR

【速读】: 该论文旨在解决特定领域因缺乏参考本体(ontology)而导致信息组织与共享困难的问题,尤其针对巴西海洋领土(即“蓝色亚马逊”)这一复杂且专业性强的领域。由于本体的手动构建过程耗时费力,许多领域难以建立高质量的结构化知识体系。为此,论文提出利用大语言模型(Large Language Models, LLMs)作为领域专家的角色,自动构建初始概念的概念层次结构,以加速本体的生成。其解决方案的关键在于:通过 GPT-3.5 与 GPT-4 等先进语言模型对给定初始概念进行语义扩展,自动生成具有逻辑一致性的概念层级,从而为领域知识提供初步的结构化表示。实验结果表明,尽管生成的本体整体上具备较高的连贯性,但均需进一步人工精炼才能达到可直接应用的水平,凸显了当前 LLM 在本体生成中作为辅助工具的潜力与局限性。

链接: https://arxiv.org/abs/2606.20691
作者: Vivian Magri Alcaldi Soares,Renata Wassermann
机构: University of São Paulo (USP); Center for Artificial Intelligence (C4AI)
类目: Computation and Language (cs.CL)
备注: Presented at NeLaMKRR@KR, 2025 ( arXiv:2511.09575 )

点击查看摘要

Abstract:Ontologies are useful structures to organize and maintain information that can be understood both by humans and systems. However, since their manual crafting is a laborious task, many specific domains lack reference ontologies. The outstanding ability for understanding natural language demonstrated by the Large Language Models (LLMs) has motivated their application to aid on a variety of fields, including on ontology development. This work presents the experimentation with a technique that uses LLMs in the role of domain experts to build conceptual hierarchies for a given initial concept. Twenty ontologies automatically constructed for the domain of the Brazilian maritime territory (a.k.a the Blue Amazon) using GPT-3.5 and GPT-4 were then evaluated by human experts. The models were able to construct overall coherent conceptualizations of the domain, but none of the outputs was completely satisfactory as a representation of the context without refinement.

[NLP-208] From Question Answering to Task Completion: A Survey on Agent System and Harness Design

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体(agent)系统中性能瓶颈的根源问题,即瓶颈究竟存在于基础模型、执行框架(execution harness),还是二者之间的耦合关系之中。其核心解决方案在于提出“模型-框架”(model-harness)分析范式,将LLM-based agent视为一个基础模型与执行框架耦合的系统,并将其执行框架解构为六个相互关联的运行时职责:感知(observation)、上下文管理(context)、控制(control)、动作执行(action)、状态维护(state)和验证(verification)。通过这一分解框架,论文系统性地分析了不同任务特性与领域需求对执行框架配置的影响,评估了现有基准测试与评价方法的有效性,并综合证据揭示了运行时设计在长周期任务完成、效率与可靠性方面的重要作用。研究强调,智能体的性能(包括成功率、效率、安全性和泛化能力)并非仅由模型能力决定,而是模型能力、运行时基础设施、任务结构与评估设计之间协同作用所涌现的结果。

链接: https://arxiv.org/abs/2606.20683
作者: Jianyuan Guo,Zhiwei Hao,Chengcheng Wang,Cheng Fan,Tingzhang Luo,Hongguang Li,Ying Gao,Hefei Mei,Jiankun Peng,Rongjian Xu,Minjing Dong,Han Wu,Mengyu Zheng,Kai Han,Shiqi Wang,Chang Xu,Yunhe Wang
机构: City University of Hong Kong (香港城市大学); University of Sydney (悉尼大学); Peking University (北京大学); TokenRhythm Technologies (TokenRhythm 技术公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents mark a shift from passive question answering to active task completion: they perceive environments, invoke tools, maintain state, and act over extended horizons. As agent systems have evolved from prompt engineering to workflows and context engineering, harness engineering, and agent-native training with co-evolution, a central question has become increasingly important: where does the bottleneck in agent performance reside, in the foundation model, in the execution harness, or in the coupling between them? This survey examines LLM-based agents through a model-harness lens. We first clarify the functional definition of agents and the implementation view of an LLM-based agent as a foundation model coupled with an execution harness. We then analyze the limits of model-centric scaling, trace four paradigms of agent engineering, and decompose the execution harness into six coupled runtime responsibilities: observation, context, control, action, state, and verification. Using this decomposition, we map task properties and domain pressures to harness configurations, review benchmark and evaluation practices, and synthesize model-harness evidence on how runtime design affects long-horizon task completion, efficiency, and reliability. Finally, we identify open challenges in value-aware evaluation, safety, harness generalization, and model-harness co-evolution. Rather than treating agents as models with auxiliary tools, this survey argues that agent quality – including success, efficiency, safety, and generalization – emerges from the interaction between model capability, runtime infrastructure, task structure, and evaluation design. A collection of papers discussed in this survey is provided in this https URL.

[NLP-209] Jury Duty: Calibration and Orientation Failures in MLLM -as-a-Judge Under Cultural Ambiguity

【速读】: 该论文旨在解决多模态大模型作为评价者(MLLM-as-a-Judge)在跨文化语境下评估一致性失效的问题,尤其针对人类评价者群体具有文化异质性时,传统以人类标注一致性为验证标准的不可靠性。其核心解决方案在于构建一个名为VOIR DIRE的多模态基准数据集,包含626个跨越美国与中国主流文化背景的图像-提示对(涵盖食物、时尚与建筑领域),并揭示了不同文化群体间评价的显著分歧(跨池相关性Q1 r = -0.12)。研究发现,六种多模态大模型在此基准上的偏差可分解为两类根本性失败:一是“正向底限校准失败”(即模型输出尺度压缩导致的响应范围受限),二是“文化取向失败”(模型默认采纳单一文化规范而缺乏跨文化敏感性)。实验表明,尽管通过角色提示(persona prompting)可部分恢复校准能力,但文化取向残差依然存在,证明该偏差并非仅由尺度压缩所致;进一步引入参考池的上下文示范虽增强了高分段表现,却加剧了文化取向偏倚,且模型来源带来的微小附加偏差(约0.10 MAE)在示范条件下保持稳定。因此,论文建议应分别报告模型与各参考池的对齐情况,并将跨池分歧视为评价者固有的属性,而非可消除的误差。

链接: https://arxiv.org/abs/2606.20676
作者: Daniel Lee,Harsh Sharma,Eunkyu Park,Pranav Narayanan Venkit,Jeonghwan Kim,Kah Mun Chia,Andreas Vlachos,Shafiq Joty
机构: Salesforce AI Research( Salesforce AI 研究); University of Cambridge(剑桥大学); University of Colorado Boulder(科罗拉多大学博尔德分校); Carnegie Mellon University(卡内基梅隆大学); UIUC(伊利诺伊大学厄本那-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image–prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r = -0.12). Across six MLLMs, the bias decomposes into two failures: a positivity-floor calibration failure (compressed scale use) and an orientation failure (default to one cultural norm). On this corpus, where contested items are sampled to split the two pools, the floor mechanically validates the more-permissive Chinese reading; persona prompting partially recovers calibration, but the orientation residual survives, evidence the tilt is not reducible to scale compression. Reference-pool in-context demonstrations deepen the orientation residual and inflate the high end rather than restoring use of the low end. Model origin adds a small additive tilt (~0.10 MAE) that is approximately invariant under demonstration. We recommend reporting alignment against each reference pool separately and treating cross-pool divergence as a judge property.

[NLP-210] DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗问答场景中因输出不一致(misaligned outputs)而导致药物相关伤害的安全隐患问题。由于医疗决策的高风险性,模型生成错误或误导性信息可能引发严重患者伤害,因此亟需有效的安全控制机制。其解决方案的关键在于提出了一套系统化的评估框架——DrugBench,该基准结合了HealthBench中的3,671个多轮医学对话与美国食品药品监督管理局(FDA)官方药品说明书中的药物信息,覆盖药物相互作用、禁忌症、剂量限制及患者行为限制四大类药物相关风险。此外,论文突破传统仅关注不安全输出概率的评估范式,提出应将输出危害的严重程度(severity)纳入安全评价标准,并据此发现现有控制协议存在可被规避的缺陷,进而引入基于严重性的监控机制以提升安全性。

链接: https://arxiv.org/abs/2606.20663
作者: Guido Freire,Agustín Martínez-Suñé,Viviana Cotik
机构: Universidad de Buenos Aires, Facultad de Ciencias Exactas y Naturales, Departamento de Computación, Argentina; AI Safety Argentina (AISAR); Department of Computer Science, University of Oxford, United Kingdom; CONICET-Universidad de Buenos Aires, Instituto de Ciencias de la Computación (ICC), Argentina
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models have the potential to expand and improve the access to clinical information by enabling new ways of interacting with medical knowledge in natural language. However, their deployment in medical question-answering settings is safety-critical, since misaligned outputs can lead to severe patient harm. AI control is an emerging approach that introduces external safeguards to mitigate unsafe behaviours in misaligned systems and has been shown to be effective in domains such as code generation. However, its applicability and effectiveness in medical settings have not been systematically studied. In this work, we present a pipeline for evaluating AI control protocols to mitigate medication-related harm. To this end, we introduce DrugBench, an AI control evaluation benchmark which combines 3,671 multi-turn medical conversations from HealthBench with drug information from official FDA labels, covering four categories of medication-related harm: drug interactions, contraindications, dosing constraints, and patient action restrictions. Furthermore, inspired by the medical domain, we argue that safety should account for the severity of unsafe outputs, not just their probability. Under this revised definition, we show that existing control protocols can be subverted and propose severity-based monitoring to address this limitation.

[NLP-211] From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents

【速读】: 该论文旨在解决当前大语言模型(LLM)智能体在集成外部工具后,虽具备自主执行能力,但缺乏自我认知(self-awareness)评估机制的问题。具体而言,现有基准测试过度关注任务执行的成功率,而忽视了智能体判断某一问题是否需要依赖外部资源或可仅通过内部参数化知识解决的能力。为此,作者提出KAPRO(Knowing-Acting Quadrant PRObe)框架,其核心在于将智能体的元认知判断(Knowing,即“是否需要外部资源”)与自发执行行为(Acting,即“是否调用工具”)解耦,从而评估其认知-行为一致性。为系统化探测这一认知边界,研究构建了KAware数据集,严格划分任务至外部、内部及混合三类子空间。实验结果表明,自知能力与任务成功率呈强相关性,但在仅依赖内部知识的场景下显著退化;此外,开源及指令跟随型模型因浅层模式匹配导致工具滥用现象严重,而专有且以推理为导向的模型则展现出更可靠的认知门控机制。

链接: https://arxiv.org/abs/2606.20661
作者: Yifan Li,Shengbin Yue,Boyu Feng,Jinhu Qi,Bo Ke,Zixing Song,Hongru Wang,Zhongyu Wei,Irwin King
机构: The Chinese University of Hong Kong; Fudan University; University of Edinburgh; Tencent; University of Bristol
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The integration of external tools has transitioned LLM agents from passive responders to autonomous systems. However, current benchmarks prioritize execution success, neglecting self-awareness capability, the ability to discern whether a problem requires necessary external resources or can be solved via internal parametric knowledge. To address this, we introduce KAPRO (Knowing-Acting Quadrant PRObe), a framework that evaluates cognitive-behavioral alignment by decoupling an agent’s metacognitive judgment (Knowing) from its spontaneous execution (Acting). We further construct KAware, a dataset rigorously partitioning tasks into external, internal, and hybrid subspaces to systematically probe these epistemic boundaries. Extensive experiments across diverse agent architectures show that self-awareness capability is strongly correlated with task success but degrades sharply in internal-capability settings. Moreover, open-source and instruction-following models exhibit stronger tool overuse due to shallow pattern matching, while proprietary and reasoning-oriented models demonstrate more reliable cognitive gating. Benchmark and codes are available at this https URL.

[NLP-212] EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis INTERSPEECH2026

【速读】: 该论文旨在解决现有基于指令的可控语音合成方法在情感控制上的局限性,即普遍依赖粗粒度情感标签且缺乏对细粒度情感强度的显式建模。其核心解决方案是提出一种双路径指令引导框架——EmoInstruct-TTS,关键在于引入Emotion2embed,一个覆盖48种情感状态(含细粒度类别与强度层级)的监督式语义-声学情感嵌入表示;同时设计了指令条件化情感流模型(ICE-Flow),能够从自然语言指令中推断出具有声学基础的情感表征,并将其融入基于大语言模型(LLM)的语音合成流程,从而实现对情感的显式精准控制,同时保持语义规划的一致性。实验表明,该方法在情感可控制性和语音自然度方面均优于现有强基线。

链接: https://arxiv.org/abs/2606.20650
作者: Minghui Wu,Ganjun Liu,Zikun Fang,Ting Meng,Hongchuan Wu,Bingao Xu,Yonglong Cai,Jiasheng Chen,Jun Du
机构: iFLYTEK Research(讯飞研究院); Huawei Technologies Co., Ltd.(华为技术有限公司); University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures, 4 tables. Submitted to Interspeech 2026. Audio demos: this https URL

点击查看摘要

Abstract:Instruction-based controllable speech synthesis enables users to specify emotions through natural language. However, existing approaches often rely on coarse emotion labels and lack explicit modeling of fine-grained intensity. We propose EmoInstruct-TTS, a dual-path instruction-guided framework for emotional speech synthesis. We introduce Emotion2embed, a supervised semantic-acoustic emotion embedding covering 48 emotional states, including fine-grained categories and intensity levels. To infer embeddings from free-form instructions, we design an Instruction-Conditioned Emotion Flow Model (ICE-Flow) that generates acoustically grounded emotion representations. The inferred embeddings are integrated into an LLM-based synthesis pipeline to provide explicit emotional control while preserving semantic planning. Experiments show improved emotional controllability and speech naturalness over strong baselines.

[NLP-213] SkillHarness: Harnessing Safe Skills for Computer-Use Agents

【速读】: 该论文旨在解决计算机使用代理(Computer-Use Agents, CUAs)在动态交互环境中进行持续技能学习时面临的安全性与鲁棒性问题。现有方法通常假设环境静态且安全,忽略了对抗性交互(如提示注入)和环境动态变化(如弹窗出现)带来的风险,导致学习到的技能可能不安全且执行脆弱。为应对这一挑战,本文提出SkillHarness框架,其核心在于将技能学习与利用建模为一种安全约束下的交互过程。关键创新包括:引入“技能边界”(skill boundary),通过多源监督信号从交互轨迹中识别安全技能,并在技能生命周期内构建自增强的安全约束;同时提出选择性技能复用机制,根据上下文动态分解任务并激活相应的技能子集。实验表明,SkillHarness可使学习技能的不安全率降低57.1%,并在环境动态变化下显著提升执行稳定性,优于现有基线方法。

链接: https://arxiv.org/abs/2606.20636
作者: Yurun Chen,Biao Yi,Keting Yin,Shengyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Computer-Use Agents (CUAs) are increasingly deployed in dynamic interactive environments, creating a growing need for continual skill learning during interaction. Recent approaches address this challenge by learning reusable skills from successful trajectories. However, these skill learning methods largely assume static and safe environments, overlooking risks from adversarial interactions (e.g., prompt injections) and environmental dynamics (e.g., pop-ups). In dynamic settings, such assumptions can lead to risky skill learning and brittle execution, undermining the reliability of CUAs. This raises the question: how can CUAs learn and use skills safely in dynamic environments? To address this problem, we propose SkillHarness, a framework for safe skill harnessing in dynamic environments. SkillHarness moves beyond static skill abstractions by modeling skill learning and utilization as a safety-constrained interaction process. Specifically, we introduce the skill boundary that leverages multi-source supervision signals to identify safe skills from interaction trajectories, and construct self-improving safety constraints throughout the skill lifecycle. In addition, SkillHarness introduces selective skill reuse, where tasks are guided to decompose according to context and completed through the selective activation of skill subsets. Our experiments demonstrate that SkillHarness significantly reduces the unsafe rate of learned skills by 57.1% and consistently improves execution stability under dynamic environmental changes, outperforming existing baselines.

[NLP-214] Post-Training Recipe More Than Model Family Shapes Multi-Agent LLM Conversational Behavior

【速读】: 该论文旨在解决多大语言模型(Multi-LLM)系统中对话行为多样性不足的问题,核心关切在于:在实际部署的交互式多模型协同场景下,传统基于“模型家族”(model family)标签来选择多样化模型的策略是否仍有效。已有离线研究表明,不同家族的模型在独立互评时表现出显著差异,因此推荐每家族选取一个模型以实现多样性。然而,该研究发现,在真实的交互式多模型系统中,同一基础架构(same-base)的模型因协作对象不同而产生显著的行为差异——例如,经过推理提炼的Llama检查点(checkpoint)在回应不同同基伙伴时,其“模糊表达度”(hedging)指标变化高达18%,超过了跨家族之间的差异。这一现象表明,模型的后训练配方(post-training recipe)是影响对话行为的关键因素,而仅依赖家族标签作为多样性代理是不充分的。因此,解决方案的关键在于将后训练配方视为多模型系统设计中的首要维度,并强调需通过真实交互环境下的实证评估来优化模型组合策略。

链接: https://arxiv.org/abs/2606.20632
作者: Luyang Zhang,Jialu Wang,Fei Xue,Yi-Yun Chu
机构: Carnegie Mellon University (卡内基梅隆大学); University of California, Santa Cruz (加利福尼亚大学圣克鲁斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Multi-LLM systems use multiple language models to deliberate, judge each other’s outputs, or coordinate as agents. Their value depends on the models producing measurably different conversational behaviors when given the same input. Prior offline studies recommend drawing one model per family for behavioral diversity, because LLMs prefer outputs from their own family when rating one another in isolation. Whether the same family label predicts behavior in interactive multi-LLM systems, the setting that real deployed systems use, has not been tested. We study this with a 940,000-chain 11-checkpoint corpus and a 1.6M-chain same-base Llama factorial. On our validated headline metric, hedging, a reasoning-distilled Llama checkpoint shifts by 18% depending on which same-base partner it replies to, more than any cross-family hedging gap in the controlled subset. Qwen, closed-API, and runtime checks suggest the pattern is not isolated, while repair and challenge analyses remain exploratory because their surface-cue detectors are weaker. Overall, the results identify post-training recipe as a first-class axis for multi-LLM panel composition and show that model family alone is an incomplete proxy for conversational diversity.

[NLP-215] AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在因子挖掘(alpha mining)过程中面临的组合爆炸搜索空间、噪声且非平稳的反馈信号、重复发现以及因盲目复用历史成功经验导致的过拟合风险等问题。其核心解决方案是提出AlphaMemo,一种具备结构化搜索过程记忆(Structured Search-Process Memory)的自演化因子挖掘代理。关键创新在于不单纯记忆最终因子或完整搜索轨迹,而是记录在特定父因子上下文中有效或无效的可复用编辑模式(edit motifs),通过分析抽象语法树(Abstract Syntax Tree, AST)差异提取这些模式,并结合基于置信度门控的残差记忆机制与不对称否决控制策略,以抑制高置信度的失败模式。实验在CSI 500和S&P 500数据集上验证了该方法在样本外表现及固定预算下的发现效率上的显著提升,消融实验进一步证实了残差学习、置信度门控、AST差异模式和否决记忆机制的关键作用。

链接: https://arxiv.org/abs/2606.20625
作者: Hang Yu,Zifan Zheng,Jeff Z. Pan,Tongliang Liu,Zhiyong Wang,Fengxiang He
机构: University of Sydney(悉尼大学); University of Edinburgh(爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM agents are promising for alpha mining via combining financial priors, symbolic reasoning, executable factor generation, and feedback-driven refinement. Yet, they face a combinatorial search space, noisy non-stationary feedback, redundant discoveries, and overfitting risks from naively reusing past successes. To address these challenges, we propose AlphaMemo, a self-evolving alpha mining agent with Structured Search-Process Memory. Rather than memorizing only final factors or full trajectories, AlphaMemo records reusable evidence about which edit motifs work or fail under specific parent-factor contexts. It extracts motifs from Abstract Syntax Tree (AST) differences, applies confidence-gated residual memory on top of a search-ledger prior, and uses asymmetric veto control to suppress high-confidence failure patterns. Experiments on CSI 500 and S\P 500 show improved out-of-sample performance and fixed-budget discovery efficiency, with ablations validating the roles of residual learning, confidence gating, AST-diff motifs, and veto memory. Code is at this https URL.

[NLP-216] In LLM Reasoning there is Irrationality on top of Value Misalignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中虽经良好对齐(value alignment),仍无法有效最大化所对齐价值函数的问题。其核心问题在于:即使模型在训练阶段已与目标价值函数对齐,其在实际推理时采用的策略可能偏离理性最优策略,从而导致价值实现不足。为此,论文提出“理性价值风险”(rational value risk)这一概念,用于量化模型部署时的推理策略与其理论上应采取的、以最大期望效用方向为指引的理性策略之间的效用差距。解决方案的关键在于从有限候选答案、有限提示(prompt)以及验证器不完美三个维度,对理性价值风险的估计误差进行分解分析,并通过覆盖多个主流模型(如Llama-3.1、Qwen-2.5、Tülu-3系列、GPT-5.2/5.5、DeepSeek-V4)和多类基准测试(UltraFeedback、AlpacaEval、GSM8K、MATH、HumanEval、MathArena)的广泛实验,验证了理性价值风险的普遍性、对推理策略的高度敏感性,以及长推理链虽能提升理性但存在边际递减效应等关键发现。

链接: https://arxiv.org/abs/2606.20624
作者: Kejiang Qian,Fengxiang He
机构: University of Edinburgh
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Significant progress has been made in aligning LLMs with target value functions. We argue that, even when an LLM has been well aligned in (post-)training, it may still fail to maximise the aligned value in reasoning. We mathematically formalise this gap as rational value risk: the utility discrepancy between a model’s deployed reasoning strategy and its rational counterpart, which is defined to be the responses that maximise expected utility in the steepest direction. The estimation error of rational value risk is further decomposed into three components from finite candidates, finite prompts, and imperfect verifiers. Extensive experiments are conducted, covering models Llama-3.1, Qwen-2.5, T"ulu-3 families (7B-72B), GPT-5.2, GPT-5.5, and DeepSeek-V4, and benchmarks UltraFeedback, AlpacaEval, GSM8K, MATH, HumanEval, and MathArena. The results validate that (1) rational value risk is widespread; (2) value alignment can reduce, but cannot eliminate, it; (3) the risk is highly sensitive to inference-time reasoning strategy; and (4) longer reasoning improves rationality with diminishing returns. The code is at this https URL.

[NLP-217] Path-dependent program induction under resource constraints explains human sequence learning

【速读】: 该论文旨在解决在有限认知资源条件下,个体如何从序列化经验中构建抽象且可复用的知识这一核心问题。其解决方案的关键在于提出一种分层适配语法(Hierarchical Adaptor Grammar, HAG),通过区分局部(任务内)与全局(跨任务)知识库,结合记忆与计算的双重约束,实现对潜在结构的高效编码与发现。HAG通过引入率-失真理论(rate-distortion theory)与程序归纳(program induction)的融合框架,揭示了先验知识如何影响未来可被低成本编码和易发现的结构。仿真结果表明,相较于固定语法规则或浅层分块方法,HAG在率-失真权衡和泛化能力上均表现更优;在线旋律序列学习实验进一步验证,参与者回忆错误呈现系统性简化特征,反应时在推断出的程序边界处显著增加,且逐次试验的拟合结果显示,分层知识库能最优解释个体在回忆与样本外延续选择中的差异。研究将结构化学习重新定义为受限程序归纳过程,强调经验顺序对学习者未来抽象建构的塑造作用。

链接: https://arxiv.org/abs/2606.20623
作者: Hanqi Zhou,David G. Nagy,Peter Dayan,Charley M. Wu
机构: University of Tübingen (图宾根大学); Technical University Darmstadt (达姆施塔特工业大学); Hessian.AI (黑森人工智能); Max Planck Institute for Biological Cybernetics (马克斯·普朗克生物控制论研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:How do people build abstract, reusable knowledge from sequential experience under bounded cognitive resources? To answer this question, we integrate rate-distortion theory with recent advances in program induction to describe how prior knowledge shapes which future structures are cheap to encode and easy to discover. We formalize this in a hierarchical Adaptor Grammar (HAG) with distinct local (within-task) and global (across-task) libraries, governed jointly by constraints on memory and computation. In simulations, HAG achieves better rate-distortion trade-offs and stronger generalization than fixed grammars or shallow chunking methods. In an online melodic sequence-learning experiment, participants’ recall errors reflected systematic simplifications and reaction times increased at inferred program boundaries. Trial-by-trial fits further showed that hierarchical libraries best explained individual differences in both recall and out-of-sample continuation choices, outperforming all alternative models. These findings cast structured learning as bounded program induction in which the order of experience shapes future abstractions a learner builds.

[NLP-218] Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在提示工程中行为可预测性与可控性不足的问题,核心挑战在于缺乏对模型如何精确理解语言线索的可量化、可扩展的认知。其解决方案的关键在于引入基于谢帕利值(Shapley values)的严谨框架,用于定量评估单个形容词对模型性能的“引导效应”(steering effect),从而从经验性启发式方法转向基于原理的归因分析。研究通过在MMLU基准上对包括o3、gpt-4o-mini、phi-3、llama-3-70b和deepseek-r1在内的多种模型进行系统分析,发现少数形容词构成具有显著影响力的“杠杆点”,但其作用不具普适性;跨模型分析揭示了“家族效应”——同源模型表现出相似的敏感性模式,而架构迥异的模型则响应高度独立,这否定了通用提示策略的有效性。进一步研究表明,这些关键形容词的引导方向并非固有属性,而是强烈依赖于其在提示中的句法角色与位置;在大型模型如gpt-4o-mini中首次提供了强非加性交互效应的定量证据,即形容词间可产生协同增强、拮抗抑制甚至反向作用,表明模型规模提升使其提示理解趋于复杂且非线性,而小型模型如phi-3则表现更字面化、组合性较弱。该研究揭示了模型规模带来的解释复杂性与可控性下降之间的矛盾,强调必须发展具有组合性和模型特异性(compositional and model-specific)的对齐技术以实现可靠控制。

链接: https://arxiv.org/abs/2606.20572
作者: Lars Malmqvist
机构: Research and Implementation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for TMLR, this https URL

点击查看摘要

Abstract:Achieving reliable control of Large Language Models (LLMs) requires a precise, scalable understanding of how they interpret linguistic cues. We introduce a rigorous framework using Shapley values to quantify the steering effect of individual adjectives on model performance, moving beyond anecdotal heuristics to principled attribution. Applying this method to 100 adjectives across a diverse suite of models (including o3, gpt-4o-mini, phi-3, llama-3-70b, and deepseek-r1) on the MMLU benchmark, we uncover several critical findings for AI alignment. First, we find that a small subset of adjectives act as disproportionately powerful “levers,” yet their effects are not universal. Cross-model analysis reveals a “family effect”: models of a shared lineage exhibit correlated sensitivity profiles, while architecturally distinct models react in a largely uncorrelated manner, challenging the notion of a one-size-fits-all prompting strategy. Second, focused follow-up studies demonstrate that the steering direction of these powerful adjectives is not intrinsic but is highly contingent on their syntactic role and position within the prompt. For larger models like gpt-4o-mini, we provide the first quantitative evidence of strong, non-additive interaction effects where adjectives can synergistically amplify, antagonistically dampen, or even reverse each other’s impact. In contrast, smaller models like phi-3 exhibit a more literal and less compositional response. These results suggest that as models scale, their interpretation of prompts becomes more sophisticated but also less predictable, posing a significant challenge for robustly steering model behavior and highlighting the need for compositional and model-specific alignment techniques.

[NLP-219] Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices

【速读】: 该论文旨在解决代理驱动型问答(QA)应用中,基于检索增强生成(RAG)的大型语言模型(LLM)因检索结果固有的噪声及文档级粒度粗糙而导致的上下文冗余问题。这一问题使得包含用户查询与相关检索内容的代理提示(prompt)在推理过程中引入不必要的计算开销。现有提示压缩方法多依赖辅助的小型语言模型(SLM)来评估上下文重要性,但此类方法带来显著的内存与计算负担,限制了其在资源受限边缘设备上的部署。本文提出一种无需SLM的两阶段句级提示压缩方法CORE:第一阶段通过命名实体识别(NER)构建答案集,并通过语义匹配生成线索集;第二阶段利用正交残差检索策略优化线索集,并设计基于空间邻近性的度量过滤答案集,最终融合两集合形成压缩后的上下文。实验表明,在2000词元预算下,CORE相较当前最优基线方法至少提升30.19%的准确率,内存占用降低至少50.47%,边缘设备上推理速度提升至少1.94倍;相较于主流的LLMLingua2方法,其在智能手机上实现95.74%的能量节约,充分验证了该方案在移动端部署中的实用性与通用性。

链接: https://arxiv.org/abs/2606.20571
作者: Zihuai Xu,Ruofei Hou,Yang Xu,Hongli Xu,Yunming Liao,Ying Zhu
机构: University of Science and Technology of China (中国科学技术大学); Suzhou Institute for Advanced Research, University of Science and Technology of China (中国科学技术大学苏州高等研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In agent-driven question answering (QA) applications, retrieval-augmented generation (RAG) is commonly introduced to enhance the response accuracy of large language models (LLMs) by providing additional context. Due to the inherent noise in retrieval results and the coarse granularity of document-level retrieval, the retrieved context often contains substantial redundant information. In this setting, the agent prompt, consisting of the user query and the associated retrieved context, leads to unnecessary computational overhead during LLM inference. Existing prompt compression methods typically rely on auxiliary small language models (SLMs) to estimate context importance. However, such approaches introduce significant memory and computational overhead, which limits their deployment on resource-constrained edge devices. In this paper, we propose CORE, a two-stage sentence-level prompt compression method that eliminates the need for SLMs. In the first stage, CORE constructs an answer set via named entity recognition (NER) and a clue set via semantic matching. In the second stage, CORE refines the clue set using an orthogonal residual retrieval strategy and designs a spatial proximity-based metric to filter the answer set. The two sets are then combined to form the final compressed context. We implement CORE on an NVIDIA Jetson AGX Orin edge device and a Huawei Nova smartphone. Experimental results demonstrate that within a 2000-token budget, CORE improves accuracy by at least 30.19% compared to state-of-the-art baselines, while reducing memory usage by at least 50.47% and achieving at least 1.94 times speedup on the edge device. Moreover, compared to the state-of-the-art LLMLingua2 method, CORE achieves a substantial energy reduction of 95.74% on the smartphone, highlighting its practicality and generalizability for mobile deployments.

[NLP-220] Sexualised synthetic personas encode and amplify gendered power asymmetries through voice INTERSPEECH2026

【速读】: 该论文旨在解决当前商业语音生成系统中性别表现的固化与异化问题,特别是其如何在技术设计中延续并强化二元性别规范、异性恋霸权及对女性与LGBTQ+群体的结构性压迫。研究发现,尽管生成式语音技术(Generative Voice AI)可能为性别表达提供多样性与赋权潜力,但实际应用中仍呈现出高度窄化的性别表演模式。其解决方案的关键在于通过女性主义人机交互(Feminist HCI)视角,结合听觉实验、量化形容词选择、质性自由文本反馈与声学分析,揭示商业语音系统在脚本内容(如性化脚本或中性文本)影响下对性别编码声音的差异化呈现:女性化语音更常被描述为性化与顺从,而男性化语音则更多关联于支配性与积极特质,从而暴露了技术系统对主流性别意识形态的再生产。

链接: https://arxiv.org/abs/2606.21366
作者: Alice Ross,Ariadna Sanchez,Elin Kanhov,Catherine Lai,Eva Szekely
机构: University of Edinburgh (爱丁堡大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:This work examines sexualised AI-generated English-speaking voices offered by a popular commercial platform. New technologies may enable sexual empowerment and greater diversity in gender expression, yet toxic masculinity, heteronormativity, and the abuse of women and LGBTQ+ people remain pervasive online. Drawing on a Feminist HCI perspective, we examine how commercial voice AI systems reproduce and circulate particular performances of gender. We conducted a listening experiment with a diverse group of listeners, combining quantitative adjective selection, qualitative free-text responses, and acoustic analysis. Participants evaluated male- and female-coded voices presented with either sexualised scripts or neutral text. Results reveal a narrow range of gender expression, largely binary and heteronormative. Female-coded voices are more frequently described using sexualised and submissive terms, while male-coded voices are more often associated with dominance and positive traits.

[NLP-221] An Evaluation Framework for Text-to-Speech Voice Reconstruction INTERSPEECH2026

【速读】: 该论文旨在解决语音重建(Voice Reconstruction)中评估方法的局限性问题,特别是针对言语障碍者使用文本转语音(TTS)技术时,如何有效衡量重建语音的可理解性与说话人身份保留程度。传统评估依赖平均意见分(Mean Opinion Score, MOS),但其在敏感性和可靠性方面存在不足。为此,论文提出一种融合主观与客观评估的综合框架:主观层面采用最佳最差量表(Best Worst Scaling, BWS)结合情境化表述,以更精准地评估感知到的可理解性与说话人身份;客观层面则揭示了现有标准指标在高度不清晰说话人中的预测失效问题,并引入一种新型双参考分布度量(dual-reference distributional measure),用于量化可理解性与说话人身份之间的权衡关系。通过在193名说话人、17个零样本TTS系统上的实验验证,该框架展现出更高的可靠性与任务相关性,为语音重建效果评估提供了更科学、全面的解决方案。

链接: https://arxiv.org/abs/2606.21343
作者: Ariadna Sanchez,Christoph Minixhofer,Korin Richmond,Ondrej Klejch,Peter Bell,Simon King
机构: The Centre for Speech Technology Research, University of Edinburgh, UK(爱丁堡大学语音技术研究中心)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and speaker similarity, but this has limited sensitivity and reliability. We propose an evaluation framework with subjective and objective components. Subjectively, we evaluate perceived intelligibility and speaker identity using Best Worst Scaling (BWS) with situational framing. Objectively, we demonstrate that standard measures fail to predict reconstruction success for highly unintelligible speakers, so we introduce a novel dual-reference distributional measure to assess the trade-off between intelligibility and speaker identity. By evaluating the output of 17 zero-shot TTS systems for 193 speakers, we show that our framework provides a reliable and task-aligned approach for assessing voice reconstruction.

信息检索

[IR-0] Improving Long-Context Retrieval with Multi-Prefix Embedding

链接: https://arxiv.org/abs/2606.23642
作者: Zhenglin Yu,Xueguang Ma,Shengyao Zhuang,Zhichao Xu,Luyu Gao,Crystina Zhang,Jimmy Lin
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long-context retrieval exposes a tension: single-vector embeddings lose fine-grained detail, while token-level multi-vector methods incur prohibitive storage. We propose Multi-Prefix Embedding (MPE), which partitions a document into chunks separated by EOS tokens, encodes the full sequence in a single causal forward pass, and extracts one embedding at each prefix boundary. MPE retains cross-chunk context, enables chunk-level MaxSim matching, and trains with only document-level relevance labels. Experiments on MLDR-en, BrowseComp-Plus, and LongEmbed show that MPE is competitive with or outperforms single-vector, independent-chunk, and multi-vector baselines, while providing a natural source attribution mechanism for locating evidence chunks.

[IR-1] Multi-Vector Embeddings are Provably More Expressive than Single Vector Embeddings

链接: https://arxiv.org/abs/2606.23475
作者: Rajesh Jayaram
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multi-vector (MV) embeddings have become a powerful paradigm in neural information retrieval (IR), achieving high retrieval accuracy by representing data with multiple vectors and scoring them via the non-linear Chamfer similarity. Despite their widely perceived superiority over single-vector (SV) embeddings which use inner product similarity, to date there is no formal proof that SV similarities cannot approximate MV similarities with the same representation size. Specifically, we ask the following: for any bounded dataset size n \leq 2^poly(m) , what is the smallest dimension D so that given any collection of MV embeddings Q_1,\dots,Q_n,X_1,\dots,X_n \subset \mathbbR^d containing at most m vectors each, there always exist q_1,\dots,q_n , d_1,\dots,d_n \in \mathbbR^D satisfying |\langle q_i, d_j \rangle - \textttChamfer(Q_i,X_j)| \leq \epsilon for all i,j ? Recently, the MUVERA algorithm demonstrated that D = m^O(1/\epsilon^2) is possible. If improved to D = md , this would imply that MV embeddings are no more expressive than SV embeddings. In this paper, we rule out this scenario. Specifically, we prove the existence of a collection of MV embeddings in \mathbbR^d , each containing at most m vectors, which require single-vector dimension of D =(\epsilon^2 m)^\Omega(1/\epsilon) to approximate, establishing a strong separation in representation size between MV and SV embeddings. Our proof leverages the Pattern Matrix Method by constructing a hard instance whose Chamfer similarity matrix encodes the NAND_k boolean function. Our results confirm a long-held belief in the IR community: at a fixed representation size, multi-vector embeddings can express similarities which cannot even be approximately represented by single vector embeddings. Subjects: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2606.23475 [cs.DS] (or arXiv:2606.23475v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.23475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Analysis of Autonomic Regulation in Cancer Survivors During Daily Physical Activity: A Real-World Wearable ECG Study

链接: https://arxiv.org/abs/2606.23461
作者: Sajad Farrokhiørcidicon,Lerick Sequeira,Shanna L. Burke,Waltenegus Dargie,Christian Poellabauer
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This study investigates heart rate (HR) and heart rate variability (HRV) responses to physical activity in breast cancer survivors using wearable electrocardiogram (ECG) data collected in real-world settings. Reliable HRV analysis in such environments is challenging due to motion artifacts and activity-related signal degradation. To address this, we use an approach that combines accelerometer and gyroscope data for activity intensity segmentation (light, moderate, vigorous) with a robust ECG processing pipeline incorporating R-peak detection and annotation-free signal quality assessment. Because vigorous activity produced unreliable HRV estimates, analyses focused on light and moderate activity levels. Using 30~s, 1~min, and 2~min windows, HR and HRV metrics were computed and compared between breast cancer survivors and healthy controls. Cancer survivors consistently exhibited elevated HR and reduced HRV across activity levels. During light activity, HR increased from 95.7~bpm in controls to 103.4~bpm in cancer survivors. Differences became more pronounced during moderate activity, where RMSSD decreased from 39.7~ms to 22.1~ms and SDNN from 42.6~ms to 25.1~ms. Statistical analyses showed significant group differences with strong and consistent effects across observations. In addition, the proposed ECG quality assessment framework reliably identified high-quality signal segments, achieving near-perfect valid RR ratios (0.99) without manual annotations. Overall, these findings demonstrate impaired and activity-dependent autonomic regulation in cancer survivors and highlight the importance of motion-aware activity segmentation and robust ECG quality control for accurate physiological monitoring in real-world wearable settings.

[IR-3] URecJPQ: Memory-efficient Multimodal Recommendation Models through RecJPQ in Large-Scale Scenarios

链接: https://arxiv.org/abs/2606.23291
作者: Giuseppe Spillo,Zixuan Yi,Aleksandr Petrov,Cataldo Musto,Craig Macdonald,Iadh Ounis
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Training state-of-the-art recommendation models on large-scale industrial datasets can be a challenging task due to the high number of users and items which are typically represented through ID embeddings. Such embeddings typically require a large amount of memory resources, which are not always available. This problem is further exacerbated in multimodal recommendation, in which multimodal item features generally improve recommendation performance, but require more resources to encode. In this paper, we introduce URecJPQ, a Joint Product Quantization method specifically designed for large-scale and multimodal top-k recommendation tasks, in which the vast number of users and items, combined with the available modalities, further increases the memory demands for the computation. The core idea is to represent each user/item not as a fully learned, unique embedding, but rather as a concatenation of shared learned sub-embeddings, thereby significantly reducing the total number of trainable parameters. Our experiments on three widely-used datasets across different domains (movies, baby and sports products) show that URecJPQ can be effectively applied to multimodal recommendation settings. In large scale scenarios, we observe a substantial reduction in checkpoint sizes and the number of trainable parameters (ranging from 86% to 98%, and 98% to 99%, respectively), with only a marginal decrease in accuracy (8.5% on recall and 16% on NDCG, on average), and, in some cases, even performance improvements (up to 85%), as in the baby products domain. Our codebase is available at this https URL.

[IR-4] Ranking Companion: A Visual Analytics Approach to Item-Based Ranking with Hybrid Item Selection

链接: https://arxiv.org/abs/2606.23263
作者: Aman Kumar,Maximilian Tornow,Michaela Benk,Ibrahim Al-Hazwani,Jürgen Bernard
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 8 pages, 4 figures, supplementary material and video guide available online

点击查看摘要

Abstract:Personalizing item ranking creation is a challenging task, especially when users lack knowledge of data attributes or the ability to express and formalize their attribute preferences. Item-based ranking creation is an approach allowing users to directly externalize preferences through known-item judgments rather than attribute-based scoring. However, a core challenge of item-based ranking is identifying and selecting representative candidate items for externalizing preferences. Existing approaches rely on singular item-selection methods, limiting flexibility and user control. To address this challenge, we present Ranking Companion, a visual analytics approach for item-based ranking that combines model-driven active learning with human-driven item-selection methods. By drawing from six complementary item-selection methods, users can externalize listwise preferences based on selected candidate items, while an iterative machine learning process with a ranking model calculates ranking results, presented to users alongside explanations for interpretation. We evaluated Ranking Companion in a formative user study with 10 participants, in which participants used each item-selection method across three iterations, revealing tradeoffs in perceived ranking quality across accuracy, diversity, novelty, transparency, control, and satisfaction. Ranking Companion contributes a unified interactive item selection space and provides preliminary empirical guidance toward the hybrid use of multiple complementary item-selection methods in personalized item-based ranking creation.

[IR-5] he Correct Answer Trap: Pedagogically-Grounded Detection and Feedback for Hidden Misconceptions

链接: https://arxiv.org/abs/2606.23205
作者: Moiz Imran,Sahan Bulathwela
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the AIED PEAF 2026: Workshop on Pedagogical Evaluation of Automated Feedback, June 28, 2026, Seoul, South Korea

点击查看摘要

Abstract:Automated feedback systems that rely on answer correctness will reinforce, rather than address, misconceptions when students reach the correct answer through flawed reasoning. We investigate automatic detection of these hidden misconceptions using 20,964 real student responses from the Eedi mathematics platform. Fine-tuned classifiers detect only 57% of these hidden misconceptions, and standard ML interventions do not improve on this. An open-weight reasoning model detects 84%, but at realistic prevalence, false alarms outnumber genuine detections roughly 8 to 1. We present a graduated assessment rubric that separates answer correctness from method validity, and propose a detect-verify-escalate pipeline that routes uncertain cases to diagnostic follow-up questions rather than directly to teachers. Two deployment modes adapt the pipeline: a teacher dashboard where the system filters a review queue, and an autonomous tutor where flags trigger low-cost formative follow-up.

[IR-6] he Language Blind Spot: How Query Language and Brand Recognition Tier Shape AI-Constructed Brand Reputation Across Twelve European Languages

链接: https://arxiv.org/abs/2606.23165
作者: Dmitrij Żatuchin(Estonian Entrepreneurship University of Applied Sciences (EUAS), Tallinn, Estonia, a href=“http://Rankfor.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a, Tallinn, Estonia)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 17 pages, 3 figures. Data and analysis code on Zenodo, this https URL

点击查看摘要

Abstract:Large language models (LLMs) increasingly mediate how people form impressions of organisations, yet most monitoring is done in English, assuming an English query returns a representative picture. We measure how far that holds. We queried three grounded LLMs (GPT-5.4, Gemini 3.1 Pro, Perplexity Sonar Pro) about 66 brands from eleven Northern, Baltic, and Central European markets, in twelve languages across four families (Germanic, Uralic, Baltic, Slavic), generating 35,640 responses. Multilingual embeddings (BGE-M3) allow cross-language comparison without translation. Three results emerge. First, AI-constructed reputation is language-bound: mean cross-language cosine similarity is 0.825, same-family responses are more similar than cross-family (0.844 vs 0.820; d = 0.31), and sentiment varies by language (F = 268.5, eta^2 = 0.077), with Uralic and Baltic languages most positive and Germanic, including English, most critical; clustering recovers the Slavic and Baltic families (cophenetic 0.915). Second, query language shifts which brands are recommended far more than how they are described: moving from an English query to a brand’s home language raises recommendation share by 0.80 for local champions but only 0.15 for global multinationals (t = -8.84, p 0.001), with no comparable reversal in sentiment. An English-only audit therefore understates a local champion’s AI visibility. Third, response stability varies more with model choice than with language (eta^2_model = 0.32 vs eta^2_language = 0.01, on a five-iteration replication over a 20-brand subset). These results indicate that English-only AI reputation monitoring leaves a measurable language blind spot, concentrated in the visibility of locally headquartered brands.

[IR-7] Who Owns the AI Recommendation? A Multi-Industry Empirical Map of Brand Category Ownership Across Large Language Models

链接: https://arxiv.org/abs/2606.23057
作者: Dmitrij Żatuchin
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 21 pages, 4 figures, 7 tables. Under review at Journal of Marketing Analytics (Palgrave Macmillan). Data and analysis code on Zenodo, this https URL

点击查看摘要

Abstract:Large language models now mediate how buyers discover products and services, making the competitive structure of AI-generated recommendations a strategic concern for brands. A basic question has lacked large-scale empirical answers: in a given category, which brand does a model recommend, and how concentrated is that ownership? Across 3,750 responses spanning 50 brands, five industries, and 250 brand-free category queries on three models (GPT-5.2, Google Gemini 3 Flash, and Perplexity sonar-pro), each query repeated five times under a dice-roll stability protocol, we propose three exploratory metrics: the Category Ownership Index (COI), a brand’s share of mentions within a category; the Competitive Vacuum Index (CVI), flagging categories with no single leader; and the Displacement Score (DS), quantifying asymmetric substitution between brand pairs. In this sample, recommendation concentration was moderate: the mean Gini coefficient was 0.28 (95% CI [0.16, 0.41]), below the 0.60 power-law threshold we set. Competitive vacuums were rare, appearing in 8.0% of queries, so the models named at least one sampled brand in most cases. Cross-model agreement on the top-recommended brand was 41.6%: a top position on one model did not reliably hold on another. Displacement was industry-dependent, from co-recommendation in consulting (0.4:1) to one-directional substitution up to 4.3:1, with an unweighted mean of 2.4:1 across the five industries. A BERTopic check placed only 4.2% of discovered topic clusters outside the original categories. Within the scope studied, these results sit in tension with a strong winner-takes-all narrative around AI recommendation, and the three metrics offer a candidate, reproducible procedure for competitive-intelligence analysis that future work can validate.

[IR-8] LLM -as-a-Judge for Reliable and Explainable Offline Evaluation in Top-K Recommendation KDD2026

链接: https://arxiv.org/abs/2606.22961
作者: Yue Que,Junyi Zhou,Xiaokun Zhang,Haiming Jin,Qiao Xiang,Chen Ma
类目: Information Retrieval (cs.IR)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Recommendation evaluation plays a crucial role in guiding the refinement and deployment of recommender systems. Most existing trials rely on offline evaluation using Top-K metrics computed over holdout user behaviors. However, we identify two fundamental limitations that undermine their ability to deliver reliable and explainable evaluations. Regarding reliability, offline evaluation treats observed user feedback as a proxy of true preferences and enforces rigid ID matching between the proxy and recommendation. In practice, feedback collections are inherently shaped by incomplete and biased item exposure, leading to distorted and unreliable assessments. Regarding explainability, Top-K metrics only establish numerical scores without offering meaningful insights to support them, thereby reinforcing the black-box nature of offline evaluation. In this paper, we propose a reliable and explainable LLM-as-a-Judge framework for offline recommendation evaluation. To enhance reliability, we introduce a semantic proxy from user textual behaviors to represent their true preferences. This proxy allows for more flexible matching between preferences and recommendations in the semantic space, rather than depending on the holdout feedback. To ensure explainability, the LLM Judge adopts a reasoning-then-scoring process to generate relevance judgments along with explicit rationale. Finally, we aggregate the individual scores into global Top-K metrics to quantify overall recommendation quality, and provide justification for each preference hit or miss. Extensive experiments demonstrate that the LLM Judge achieves solid reliability, explainability, and robustness in evaluation. Comments: Accepted by KDD 2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.22961 [cs.IR] (or arXiv:2606.22961v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.22961 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770855.3818169 Focus to learn more DOI(s) linking to related resources

[IR-9] rajectory-Based Recommender Systems as Control Systems

链接: https://arxiv.org/abs/2606.22957
作者: Eriam Schaffter(UCBL),Ahmed Bounekkar(UCBL),Elsa Negre
类目: Information Retrieval (cs.IR)
备注: ICAT2025, Nov 2025, Marrackech, Maroc, Morocco

点击查看摘要

Abstract:Recommender Systems (RS) are a key research domain and play an increasing role in our content-overwhelmed lives. In this paper, we explore Trajectory-Based Recommender Systems (TBRS), a subfield for which many related studies exist, yet still lacking a common framework. We argue that Control Theory provides an appropriate foundation for formalizing and solving TBRS problems. TBRS, sometimes named Long Term goal Recommender Systems, share core principles with classical RS, but at their core lies the concept of a trajectory, a defining element that makes these systems a singular category. To date, most RSs that include a notion of goal or long-term objective, when this goal is explicit, have not been recognized as having specific characteristics that make them worth regrouping under a dedicated field of research. We review related work, observe how they differ from already conceptualized RSs, and sketch the foundations of a possible theoretical framework based on control theory. Finally, we show how Educational Recommender Systems (ERS), intrinsically long-term and goal-driven, can be modeled within the proposed TBRS framework.

[IR-10] Graph-Enhanced Large Language Models for Spatial Search

链接: https://arxiv.org/abs/2606.22909
作者: Nicole R. Schneider,Kent O’Sullivan,Hanan Samet
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:There have been many recent improvements in the ability of Large Language Models (LLMs) to perform complex tasks and answer domain-specific questions through techniques like Retrieval Augmented Generation (RAG). However, reasoning abilities of LLMs, including spatial reasoning abilities, are still lacking. Spatial reasoning is a key component required to answer questions in a variety of domains that are grounded in the physical world, including urban planning, civil engineering, travel, and many others. To advance the development of LLMs and facilitate an impact in these domains, new research techniques must be developed to enable LLMs to reason over spatial data, which is commonly stored in the form of a graph. In this paper we outline the challenges associated with spatial reasoning through LLMs and envision a future in which search engines integrate with LLMs to answer complex spatial questions through graph-enhanced reasoning.

[IR-11] owards Fast Domain Adaptation and Fine-Grained User Simulation for Evaluating Conversational Recommender Systems

链接: https://arxiv.org/abs/2606.22803
作者: Yuanzi Li,Quanyu Dai,Xueyang Feng,Zihang Tian,Junhao Wang,Xu Chen,Zhenhua Dong,Huifeng Guo
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) enhance user experience through multi-turn interactions, yet evaluating their performance remains challenging. While Large Language Model (LLM) based user simulators are effective, they suffer from three key limitations: (1) Lack of Domain Adaptability: Reliance on fixed prompts and predefined action spaces hinders transfer to novel domains; (2) Limited User Modeling: Inability to accurately replicate subtle linguistic styles and dynamic preferences; (3) Insufficient Evaluation Validity: Existing simulators fail to adequately assess fundamental capabilities and system robustness. To overcome these, we propose AdaptSim, an Adaptive domain and automatic prompt tuning User Simulator. AdaptSim offers an efficient framework for evaluating CRSs by enabling realistic behavior modeling and diverse style generation. It leverages automatic prompt generation and an open action mechanism to reduce manual effort and improve cross-domain flexibility. For response generation, we employ controlled text generation with a “think-then-respond” strategy for fine-grained control over language style. For CRS evaluation, AdaptSim incorporates a novel Breadth-First Search (BFS)-based, turn-level pairwise comparison framework for comprehensive assessment. Extensive experiments across three domains and four LLMs demonstrate that AdaptSim generates realistic dialogues, enabling a highly effective and reliable evaluation of CRS capabilities and robustness.

[IR-12] Breaking the Evaluation Paradox: Evaluating High-Entropy Search with Computationally Irreducible Constraints ACL2026

链接: https://arxiv.org/abs/2606.22783
作者: Juntao Wu,Wei Wen,Xianting Huang,Shuai Pang,Ruizhi Qiao,Xing Sun,Ke Wang
类目: Information Retrieval (cs.IR)
备注: 25 pages, 5 figures, Accepted at ACL 2026

点击查看摘要

Abstract:Evaluating the exhaustive search capabilities of large language models (LLMs) is plagued by a fundamental paradox: verifying completeness requires complete ground truth, yet high-entropy enumeration tasks make such ground truth impossible for humans to create. This causes benchmarks to systematically penalize models for outperforming their human annotators. Despite rapid progress in web-search and deep research agents – which now issue hundreds of queries, traverse diverse sites, and synthesize long reports – evaluation still largely relies on partially annotated answer sets, LLM-based judges, or single-answer questions that avoid genuinely exhaustive search scenarios. We break this paradox by shifting the evaluation paradigm from simulating a messy reality to constructing computationally pure challenges. We introduce VERITAS (Verifiable Traversal Assessment for Search), a framework built on the principle of computationally irreducible constraints. By introducing novel, non-optimizable constraints, we create verifiable, sparse-answer search tasks that are computationally equivalent to exhaustive enumeration. These constraints are easy to verify but impossible for LLMs or search engines to optimize, forcing agents to genuinely traverse the entire search space. VERITAS can automatically generate a virtually infinite number of test cases with perfect ground truth and precise difficulty control, with marginal instance cost dominated by hash computations. This provides not only a robust benchmark for evaluating systematic exploration under uncertainty but also a scalable method for generating training data to improve these crucial, yet underdeveloped, capabilities.

[IR-13] HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

链接: https://arxiv.org/abs/2606.22778
作者: Yuichi Tateno
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 48 pages. Code and leaderboard: this https URL this https URL

点击查看摘要

Abstract:With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmarks are comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings–dimensionality reduction, quantization, reranking–across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition, model-agnostic comparison of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) at Spearman 0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiency Pareto frontier. Code, data, and leaderboard are released under the MIT license.

[IR-14] PA-User: Simulating Trust and Verification under AI-Generated Content

链接: https://arxiv.org/abs/2606.22738
作者: Saber Zerhoudi
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Most users of online information now assume that some of what they read has been written, edited, or selected by an AI model. Hybrid cases are the hardest to tell apart: human prose rewritten by a language model, AI-curated lists presented as editorial, retrieval-augmented answers composed on the fly from human sources. Users cannot reliably distinguish these cases, and the ongoing cost of checking what is genuine has become part of how they search. Current user simulators in information retrieval do not model this. We propose PA-User, a user simulator with three new components: a detection-effort budget that is spent on verification and recovers between sessions; a trust component that holds a separate Beta belief over the factuality of each source class (domain by provenance) and updates from observed outcomes; and a decision rule that picks accept, verify, or discard for each result, conditional on current trust, current effort, and per-domain stakes. We state two verification-and-validation (V\V) properties of the framework. The trust posterior converges to the true class factuality (face validity). Each component’s contribution to any observable can be isolated by ablation (structural validity). On the HC3 corpus (85,449 paired human and ChatGPT answers in five domains), PA-User reaches a trust-calibration error of 0.162 , against 0.356 for any configuration without the trust component. PA-User reduces high-stakes regret from 0.171 to 0.122 ( 29% relative) against an always-accept ablation, and verifies 34.5% of results, half the rate of an ablation with no effort budget. Each single-mechanism ablation isolates one component, which makes the framework individually diagnosable.

[IR-15] VISTA Architect: A graph database-oriented health AI system demonstrated in multidisciplinary tumor boards

链接: https://arxiv.org/abs/2606.22692
作者: Tuomo Kiiskinen,Jason Fries,Philip Adamson,David Wu,Timothy John Ellis-Caleo,Aaron Fanous,Balasubramanian Narasimhan,Joel Neal,Sylvia Plevritis,Manuel A. Rivas
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 22 pages, 4 figures, 6 tables; includes Supplementary Information. Code: this https URL (tag v0.1.0-preprint, commit 8837d44)

点击查看摘要

Abstract:We introduce VISTA Architect, a database-oriented AI architecture for integrating large language models (LLMs) with longitudinal electronic health records (EHRs). At ingestion, it transforms complex clinical documentation into a persistent, provenance-linked knowledge graph, eliminating repeated reprocessing of raw records at query time. The architecture has two layers: a source-faithful MEDS Graph preserving granular EHR structure with full provenance, and a clinically abstracted Timeline Object Architecture (TOA) that uses graph-guided LLM extraction to synthesize a concise timeline of deduplicated, temporally coherent clinical events. This addresses key limitations of direct long-context prompting and retrieval-augmented generation (RAG), which often miss temporal relationships and incur high cost and latency from repeated raw-text processing. By precomputing clinical synthesis once, downstream queries access an organized patient state and traverse to source documentation only when detailed verification is needed. We demonstrate the system in multidisciplinary thoracic oncology tumor boards at Stanford Medicine, where precise reconstruction of patient histories is critical. Across 1,180 patients, VISTA Architect achieved 96.4% accuracy (mean 9.75/10) on 15 tumor board-salient variables (17,700 evaluations; 95% CI 96.1-96.7%), surpassing a matched BM25 RAG baseline and recent benchmarks for LLM-based clinical extraction. An agentic interface reduced preparation for a 30-patient held-out cohort to about 2.2 minutes without sacrificing accuracy. While configured here for thoracic oncology, the modular design adapts to other specialties through customizable event definitions, episode structures, and agentic tools; validation beyond thoracic oncology remains future work. Comments: 22 pages, 4 figures, 6 tables; includes Supplementary Information. Code: this https URL (tag v0.1.0-preprint, commit 8837d44) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2606.22692 [cs.AI] (or arXiv:2606.22692v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.22692 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-16] All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation

链接: https://arxiv.org/abs/2606.22645
作者: Matthijs Jansen op de Haar,Tobias Stähle,Lorenzo Gatti
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
备注: 10 pages, 5 figures, version one

点击查看摘要

Abstract:Large language models have substantially improved information retrieval and question answering; however, existing datasets generally support either vector-based retrieval over unstructured text or reasoning over knowledge graphs, without providing a unified representation that combines both paradigms. Moreover, current benchmarks rarely provide ground-truth entities, relations, and fact-grounded question-answer pairs aligned with the underlying corpus. To address this gap, we introduce All Relations Lead to Rome (ARLtR), a unified framework for automated knowledge graph construction and fact-grounded question-answer generation. ARLtR jointly constructs a knowledge graph, embeddings, and question-answer pairs that are explicitly grounded in extracted entities, relations, and supporting textual evidence. We further instantiate the framework as a historical dataset centered on the Roman Empire, comprising over 19,000 entities, 16,000 chunks, and 8,400 question-answer pairs (this https URL). By tightly coupling symbolic graph representations with dense retrieval representations, ARLtR facilitates the evaluation and development of hybrid retrieval systems and semantic steering approaches within a single coherent resource.

[IR-17] Music Playlist Captioning at Scale with Large Language Models ECML-PKDD2026

链接: https://arxiv.org/abs/2606.22460
作者: Mathieu Delcluze,Léa Briand,Benjamin Chapus,Deniz Mekik,Guillaume Salha-Galvan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: ECML-PKDD 2026

点击查看摘要

Abstract:Music streaming services such as Deezer often recommend personalized playlists to users. Playlist captioning, which involves describing these playlists in natural language, is essential for helping users understand the content behind each recommendation, yet remains challenging at scale. This paper presents the automatic playlist captioning system deployed on Deezer in 2025 to address this challenge. Leveraging recent advances in large language models (LLMs) to generate descriptive captions from diverse data sources in a controlled manner, this system now powers the Daily Mix feature, used by millions of users. This deployment has led to significant improvements in user engagement, highlighting how the semantic framing of an unchanged recommendation shapes user perception in online personalized experiences.

[IR-18] ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning in Trustworthy Materials Discovery KDD2026 KDD

链接: https://arxiv.org/abs/2606.22375
作者: Yi Cao,Liaoyaqi Wang,Jieneng Chen,Benjamin Van Durme,Alan Yuille,Paulette Clancy
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注: Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Generative models have revolutionized the process of materials discovery, yet they often fail to satisfy underlying physical causality. Through an analysis of Large Language Models (LLMs) augmented with knowledge graphs derived from current literature, we uncover a phenomenon termed contextual tunneling, where models “over-anchor” on narrow, retrieved evidence while suppressing global physical reasoning. To address this problem, we introduce ARIA, a causal-aware framework that conditions knowledge use on mechanistic completeness. ARIA routes each query through a three-tier cascade: (i) direct causal reasoning when complete evidence chains of Process-Structure-Property (PSP) are available, (ii) physics-informed analogical transfer for sparse or novel material systems, and (iii) explicit parametric fallback when external evidence is incomplete. As a proof of concept, we construct a Knowledge Graph (KG) containing 2,839 extracted PSP relations from peer-reviewed articles in the materials literature and evaluate ARIA on forward prediction and inverse design tasks for two-dimensional (2D) materials. ARIA mitigates contextual tunneling, improves over unaugmented and naive KG-augmented baselines, and provides further gains when an online literature search is used for evidence enrichment. Crucially, ARIA produces auditable causal traces, enabling physically grounded and trustworthy AI-assisted materials discovery.

[IR-19] Novelty-Aware Agent ic Retrieval: Comparing Research Contributions Through Structured Multi-Step Reasoning

链接: https://arxiv.org/abs/2606.22151
作者: Shou-Tzu Han
类目: Information Retrieval (cs.IR)
备注: 9 pages, 1 figure, 14 tables

点击查看摘要

Abstract:Scientific literature search is an information retrieval (IR) task in which ranked lists are insufficient: a researcher entering a new area needs to know not only which papers are relevant, but how they relate, where they overlap, how they differ, and what problem-method combinations are absent. Standard retrieval-augmented generation (RAG) summarizes documents independently, discarding this comparative signal. We present the Novelty-Aware Research Agent, a prototype agentic retrieval system that layers structured multi-step reasoning on a RAG pipeline through six typed-contract components: query analysis, a ReAct-style retrieval loop, relevance ranking, schema-guided contribution extraction, a three-pass comparison agent, and answer generation. Beyond returning relevant papers, it produces structured comparison artifacts: per-paper contribution records, paper-level overlaps, and a problem x method gap matrix. On a 100-paper corpus, the system supports five structured comparison capabilities that a standard RAG baseline supports none of, while remaining query-sensitive: across three main queries no paper appears in all three top-5 sets (mean pairwise Jaccard 0.12), and an extended seven-query evaluation holds the pattern across ten queries (mean Jaccard 0.115, 18 of 29 retrieved papers query-exclusive). Under author-assigned graded relevance the ranker attains mean Precision@5 1.000 and nDCG@5 0.752 on the main queries, ahead of BM25, dense, and hybrid retrieval; over ten queries Precision@5 is non-saturated at 0.980 with nDCG@5 0.739. Schema compliance is 86.7% on the main queries and 84.0% over the ten-query set, and validating 20 sampled empty gap-matrix cells yields a gap precision of 0.600. We discuss the latency-structure trade-off in agentic retrieval and identify corpus scale, author-assigned labels, and limited independent evaluation as the main limitations.

[IR-20] A feasibility study on filtering low-accessibility web pages considering color vision deficiency

链接: https://arxiv.org/abs/2606.22095
作者: Ryota Mizutani,Shiori Nakayama,Masateru Tsunoda
类目: Information Retrieval (cs.IR)
备注: 4 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Recently, the importance of universal design has increased. Color universal design (CUD) is one type of universal design that takes people with color vision deficiency (CVD) into consideration. Websites are important media for providing various types of information and functions. Therefore, it is essential to enhance the accessibility of web pages by incorporating CUD principles. The goal of our study is to help improve the accessibility of web pages. Our approach is to automatically filter low-accessibility web pages. To evaluate the feasibility of this approach, we conducted an experiment using 21 web pages. The prediction model identified low-accessibility pages with reasonable accuracy, achieving a maximum AUC of 0.76.

[IR-21] Nous: A Predictive World Model for Long-Term Agent Memory

链接: https://arxiv.org/abs/2606.22030
作者: Pranav Singh
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages, 1 figure, 4 tables. Preprint; ablations, LongMemEval evaluation, and a controlled comparison against concurrent work (BeliefMem) planned for a future revision

点击查看摘要

Abstract:We present Nous, a novel agent memory architecture grounded in the principle that knowledge is prediction, not storage. Rather than persisting facts as database records, vector embeddings, or knowledge-graph triples, Nous maintains a predictive world model: a collection of categorical probability distributions, called dimensions, one per entity-attribute pair observed in conversation. Each incoming observation is scored by its information-theoretic surprise S = -log2 P(obs | D), and the distribution is updated via a closed-form Bayesian posterior. The primary stored artifact is the delta, a record of the shift from prior to posterior belief, rather than the fact itself. Forgetting emerges naturally as entropy decay toward the uniform distribution, and identity resolution is handled through mutual information between entity dimension sets. Evaluated on the LoCoMo long-term conversational memory benchmark across ten conversations (1,540 questions) using GPT-4o-mini as backbone, Nous achieves F1 of 63.50 (single-hop), 55.32 (multi-hop), 58.57 (temporal), and 62.50 (open-domain). Against A-MEM’s self-reported GPT-4o-mini numbers, Nous shows substantial gains in three of four categories, though we note that independent citations of A-MEM’s results disagree with each other on category assignment, a reproducibility issue we discuss openly rather than resolve unilaterally. We additionally compare against BeliefMem, a concurrently developed system built on the same core premise of belief-based rather than deterministic memory; on the same benchmark and backbone, Nous’s self-reported numbers exceed BeliefMem’s self-reported numbers on all four categories, though we flag several uncontrolled differences between the two evaluation pipelines that prevent this from being a fully controlled comparison. Nous requires no external vector database or graph engine.

[IR-22] he Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders KDD2026

链接: https://arxiv.org/abs/2606.21911
作者: Weiqin Yang,Yue Pan,Chongming Gao,Sheng Zhou,Xiang Wang,Can Wang,Jiawei Chen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:We identify a critical pitfall in scaling transformer-based sequential recommenders: while increasing model size improves recommendation accuracy, it simultaneously amplifies popularity bias. This bias drives systems to over-recommend popular items at the expense of niche ones, which not only undermines fairness but also degrades the broader ecosystem by reinforcing the Matthew effect and filter bubbles. Consequently, this bias amplification emerges as a fundamental obstacle to sustainable model scaling. Through comprehensive theoretical and empirical analyses, we uncover the root cause of this amplification. Our findings reveal that as model depth increases, the two core components of the transformer architecture, i.e., attention aggregation and feed-forward projections, synergistically induce severe spectral collapse in model predictions, which directly translates to the amplification of popularity bias. To address this challenge, we propose SPRINT (Scalable Popularity Regularization IN Transformers), which mitigates spectral collapse during scaling by constraining (i) the maximum column-sums of the attention score matrices and (ii) the spectral norms of the feed-forward parameters. Extensive experiments demonstrate that SPRINT significantly improves both accuracy and long-tail fairness. Crucially, it yields more favorable scaling behaviors when expanding model sizes from 0.05M to 0.34B parameters. The code is available at this https URL. Comments: Accepted by KDD 2026 Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2606.21911 [cs.IR] (or arXiv:2606.21911v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.21911 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770855.3818185 Focus to learn more DOI(s) linking to related resources

[IR-23] Gender Differences in Research Topic and Method Convergence among Collaborating Scholars in Library and Information Science

链接: https://arxiv.org/abs/2606.21908
作者: Chengzhi Zhang,Linlei Xie,Siqi Wei
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This study explores gender differences in research topic choice and methodology among collaborating scholars. Previous studies have often focused on gender differences in research topics or methods at the individual level of scholars, without considering collaborating groups, lacking depth and practical guidance. This study takes Library and Information Science (LIS) as an example, employing the Top2Vec method for topic identification and the CogFT model for research method classification. It systematically analyzes 25,204 papers published between 1990 and 2022 to investigate gender differences in the convergence of research topics and method choices among collaborating scholars in this field. The results of the study found that female scholars showed lower convergence in their research methods and topic choices compared to male scholars. This study uses a relatively systematic methodology to address the difficulty of studying gender differences in academic publishing, and is expected to serve as a reference for other disciplines and research questions. This study also emphasizes the manifestation of gender differences in collaborative research and provides insights into the convergence and diversity of research topics and methods chosen by scholars.

[IR-24] Which Review Aspect Has a Greater Impact on the Duration of Open Peer Review in Multiple Rounds? – Evidence from Nature Communications

链接: https://arxiv.org/abs/2606.21904
作者: Haomin Zhou,Ruxue Han,Jiangtao Zhong,Chengzhi Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: aslib JIM, 2026

点击查看摘要

Abstract:Purpose: Peer review is essential to scientific publishing, but increasing submission volumes have placed growing pressure on reviewers and editors. This study examines the relationship between sentiment toward specific review aspects and peer review duration. It also investigates how this relationship varies across disciplines and review rounds, with the aim of supporting targeted manuscript revision and improving review efficiency. Design/methodology/approach: We adopt a two-stage approach. First, fine-grained aspects are extracted from peer review reports, and a sentiment classification model is used to determine the sentiment associated with each aspect. Second, correlations between aspect-level sentiment and peer review duration are analyzed. Sentiment scores are also calculated for different review rounds to determine whether these relationships change over successive rounds. Findings: Review sentiment has a weak but statistically significant negative correlation with peer review duration, indicating that more positive reviews tend to be associated with shorter review periods. Aspects concerning Evaluation and Results and Impact and Research Value show relatively stronger correlations with review duration. The relationships between aspect-level sentiment and review duration also differ significantly across review rounds. Originality/value: This study connects the textual content of peer review reports with the temporal characteristics of the review process. By identifying review aspects that are more closely associated with review duration, it provides evidence that may help authors prioritize revisions and assist reviewers and editors in improving review efficiency. The findings contribute to reducing the burden of peer review and accelerating scholarly communication and knowledge dissemination. Comments: aslib JIM, 2026 Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR) Cite as: arXiv:2606.21904 [cs.CL] (or arXiv:2606.21904v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.21904 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1108/AJIM-02-2024-0158 Focus to learn more DOI(s) linking to related resources Submission history From: Chengzhi Zhang [view email] [v1] Sat, 20 Jun 2026 06:41:26 UTC (629 KB)

[IR-25] Research Method Usage across Academic Ages in Library and Information Science: An Empirical Study (1990-2023)

链接: https://arxiv.org/abs/2606.21862
作者: Chengzhi Zhang,Jiayi Hao,Yi Mao
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Academic age critically shapes career development, influencing research behavior, output volume, and methodological choices. Analyzing method variation across academic ages offers a new theoretical lens on scholarly evolution and provides early-career researchers with practical guidance for method selection. A corpus of 26,677 articles published 1990-2023 in 14 authoritative Library and Information Science journals was compiled. The CogFT model automatically classified the research methods embedded in these articles, and Top2Vec generated the topic model. This process resulted in a comprehensive dataset linking research methods with topics. Author-name disambiguation enabled calculation of each scholar’s academic age. Popularity and Shannon diversity indices for methods, together with topic diversity, were compared across academic age groups. Results reveal dynamic methodological trends: the share of theoretical approaches declined gradually, whereas experimental and bibliometric methods gained ground. Method popularity differs significantly among cohorts. Mid-career scholars exhibit the highest method diversity; late-career scholars the lowest.

[IR-26] PrivacyAlign: Contextual Privacy Alignment for LLM Agents

链接: https://arxiv.org/abs/2606.21710
作者: Manveer Singh Tamber,Abhay Puri,Marc-Etienne Brunet,Perouz Taslakian,Jimmy Lin,Spandana Gella
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:AI agents acting on behalf of users are constantly making decisions, and for users to trust their agents, those decisions must align with what they actually want. Privacy is an important alignment problem for agents: every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under which conditions. Because such judgments depend on social expectations and norms, human judgment does not merely label privacy violations but also helps define them. While existing work relies on unreliable proxies for both training and evaluation, we place human judgment at the center of agentic privacy alignment. We introduce PrivacyAlign, a dataset of 1,350 samples with 3,516 detailed annotations from 599 unique annotators across diverse scenarios where current LLMs actually leak, and use it to ground both alignment training and automated evaluation in human privacy norms. Building on these annotations, we first show that conditioning LLM judges on human annotations and explanations for reference responses to the same prompt makes their judgments more reliable. We then introduce annotation-conditioned reward modeling, which uses these annotations to score new responses during RL, and show that small open-weight agents trained with this reward better align with human privacy norms, with strong gains on PrivacyAlign and existing privacy benchmarks for agents.

[IR-27] CRAwLeR – Cross-Reference Aware Legal Retrieval

链接: https://arxiv.org/abs/2606.21676
作者: Maciej Jalocha,William Michelsen
类目: Information Retrieval (cs.IR)
备注: 26 pages, 18 figures

点击查看摘要

Abstract:Existing benchmarks for context-aware chunk retrieval rely heavily on repurposed task items and rarely demonstrate that their queries genuinely require context, making score interpretation difficult. We focus on a specific kind of context dependence, legal cross-references, and introduce CRAwLeR, an operationalization of a narrow, well-defined phenomenon: cross-reference-aware context utilization for chunk retrieval in legal documents. Our pipeline detects legal cross-references, identifies query candidates, links target chunks to their relevant context, generates context-demanding queries with an LLM, and filters them through both an adversarial non-contextual baseline and an assurance prompt. We release CRAwLeR-DK and CRAwLeR-PL, Danish and Polish datasets built with this pipeline, alongside a strong Anthropic-style contextualization baseline. Manual analysis finds that approximately 80% of randomly sampled queries genuinely target the labelled target chunk and require context, with failures following systematic and named patterns. The benchmarks are hard but not solved: best Recall@10 reaches 55% on CRAwLeR-DK and 59% on CRAwLeR-PL. Ablation and failure analysis attribute the remaining gap to the contextualising LLM, not the retriever. Even when the target is retrieved in the top ten, labelled context chunks routinely outrank it. We are the first dataset for context-aware chunk retrieval to carefully consider construct validity and inspect our results in the light of such a narrow, well-defined phenomenon.

[IR-28] ATLAS: Agent ic Taxonomy of Large-Scale Software Ecosystems

链接: https://arxiv.org/abs/2606.21597
作者: Junyi Lu,Mengyao Lyu,Jiahui Wu,Lei Yu,Chengwei Liu,Fengjun Zhang,Li Yang,Chun Zuo,Yang Liu
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at the 41st IEEE/ACM International Conference on Automated Software Engineering (ASE 2026)

点击查看摘要

Abstract:The open-source ecosystem on GitHub lacks a systematic hierarchical taxonomy of software repositories. GitHub Topics, the dominant organizational mechanism, is flat, inconsistent, and covers only 67% of projects. We present ATLAS, the first framework that automatically constructs a hierarchical taxonomy for software repositories and classifies projects into it end-to-end. By combining LLM global knowledge with real repository distributions, ATLAS proposes meaningful splitting dimensions and iteratively corrects those that fail to accommodate real projects. A Designer Agent proposes splitting dimensions while a Classifier Agent assigns repositories; a self-corrective refinement loop uses classification failures to drive dimension revision through escalating strategies. We evaluate ATLAS on 54,387 GitHub repositories against six baselines spanning four paradigms, two downstream tasks, and three model families. On a stratified 2,001-repository benchmark, ATLAS achieves a Taxonomy Quality F-score (TQF) of 83.13%, outperforming the best baseline by 15 percentage points (on the full 54k corpus the approximate TQF is 73.0%, a gap driven by Path Granularity’s all-or-nothing scoring on longer paths rather than lower classification accuracy). It is the only method to simultaneously achieve high structural quality and high practical applicability. On downstream tasks, ATLAS enables alternative discovery with P@1 = 85.71%, surpassing even human-curated lists (62.34%), and achieves the highest P@1 for repository retrieval. The taxonomy further reveals structural ecosystem trends that are difficult to obtain from flat tags or similarity methods: the shift from libraries to AI/ML applications (now 61% of newly community-adopted projects) becomes visible only through hierarchical, type-based categorization. An interactive taxonomy explorer is available at this https URL

[IR-29] Per-Entity Bias Mapping for AI Visibility: Why Brand Mentions Require Entity-Specific Calibration

链接: https://arxiv.org/abs/2606.21595
作者: Zoltan Varga
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 26 pages, 14 tables. Zenodo preprint: this https URL . Data and code: this https URL

点击查看摘要

Abstract:AI-mediated answer systems increasingly determine how brands and organizations are represented to users. Existing approaches reduce visibility to mention rate or citation frequency. This paper argues that aggregate metrics are insufficient because entities exhibit systematically different AI visibility error profiles. We introduce Per-Entity Bias Mapping (PEBM): a ten-dimensional framework distinguishing raw from verified mentions. Three failure modes are identified: (1) underrepresented entities suffer invisibility due to weak knowledge graph presence; (2) large entities suffer the Brand Hallucination Paradox – model familiarity creates stronger surfaces for plausible but incorrect completions; (3) CEE entities face a structural infrastructure gap across knowledge graphs, NER, and entity linking. A fourth dimension, Parametric-Retrieval Lag Asymmetry, describes divergence between retrieval-augmented and parametric memory update cycles. A full-scale empirical study (n=100 Hungarian B2B entities, 1,400 probe runs, 2,062 sources) finds Tier 1 brands produce 52.69% fabricated citations versus 37.87% for Tier 3 entities (+14.82 pp; p=1.67e-11), supporting the Brand Hallucination Paradox. Regulatory-framed queries elevate fabrication to 56.77% versus 37.59% baseline (+19.2 pp). We identify rejection-induced confabulation escalation: agentic quality filters function as hallucination accelerators in compliance contexts. We introduce ghost cartography as a unifying mechanism: entities in sparse latent regions produce confident output interpolated from neighboring dense regions, yielding a two-dimensional confabulation space (fabricated presence vs. frozen representation). Comments: 26 pages, 14 tables. Zenodo preprint: this https URL. Data and code: this https URL Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) ACMclasses: I.2.7 Cite as: arXiv:2606.21595 [cs.CL] (or arXiv:2606.21595v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.21595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-30] Dissecting Agent ic RAG : A Component Ablation for Multi-Hop QA with a Local 7B Model

链接: https://arxiv.org/abs/2606.21553
作者: Sheroz Shaikh
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages, 4 figures, 4 tables. Code: this https URL

点击查看摘要

Abstract:Agentic retrieval-augmented generation (RAG) systems combine iterative reasoning loops, query decomposition, and adaptive retrieval to tackle multi-hop question answering. However, the contribution of each component remains poorly understood, particularly under resource-constrained settings using only local language models. Many agentic designs add adaptive retrieval routing and deeper retrieval loops on the assumption that the added complexity helps. To test whether it does, we run a controlled ablation study of a full agentic RAG pipeline evaluated on 5,000 questions from the HotpotQA distractor development set using a local 7B parameter model (Qwen2.5-7B-Instruct). Our full pipeline achieves EM=53.2% and F1=61.6%, compared to a single-pass dense-retrieval baseline of EM=43.1% and F1=54.0%. Across eight ablation conditions, we find that: (1) fixed hybrid retrieval via reciprocal rank fusion consistently outperforms rule-based adaptive routing (+1.8 EM, +1.9 F1), as the routing heuristic over-routes to BM25 by firing on named entities present in nearly all multi-hop sub-questions; (2) two retrieval iterations over the decomposed sub-questions capture 95% of the gains of five, with no meaningful benefit from deeper loops; and (3) query decomposition and cross-encoder reranking each contribute statistically significant but smaller gains (p0.01 and p0.001 respectively). Taken together, on a fixed local-model budget, the simpler and fixed choices turn out to be competitive with or better than their adaptive versions: most of the gain comes from running a short retrieval loop, not from adaptive routing or from many iterations. We use no proprietary APIs or large-scale compute.

[IR-31] Memory Is No Longer a Bottleneck: Memory-Efficient Graph Filtering for Scalable Collaborative Filtering

链接: https://arxiv.org/abs/2606.21540
作者: Jin-Duk Park,Won-Yong Shin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Information Theory (cs.IT); Social and Information Networks (cs.SI)
备注: 13 pages, 7 figures, 8 tables; IEEE Transactions on Knowledge and Data Engineering (to appear) (Please cite our journal version.)

点击查看摘要

Abstract:Graph convolutional networks (GCNs) have demonstrated significant success in capturing complex user-item relationships for collaborative filtering (CF). However, due to their reliance on extensive model training, training-free graph filtering (GF)-based CF methods have emerged as a promising alternative, offering computational efficiency by smoothing graph signals via matrix operations. In particular, polynomial GF-based approaches demonstrate improved accuracy through their ability to design more expressive and flexible filtering functions. Despite these advantages, existing GF methods suffer from a critical memory bottleneck: they necessitate storing the full item similarity graph, incurring prohibitive memory costs for large-scale datasets, which limits their practical applicability. To tackle this challenge, we propose Mem-GF (Memory-efficient GF), a new GF-based CF method that departs from conventional designs by principally leveraging the structure of Krylov subspaces as a core mechanism for approximating polynomial graph filters without explicitly storing the item similarity graph. We theoretically analyze the minimum Krylov subspace size that guarantees lossless approximation. Through extensive experiments, we demonstrate that Mem-GF achieves up to 5.74 \times lower memory usage and 4.38 \times speedup in runtime, while consistently exceeding the recommendation accuracy of state-of-the-art GF and GCN-based methods. Mem-GF robustly scales to datasets with tens of millions of interactions, establishing itself as a practically viable and theoretically grounded solution for efficient CF.

[IR-32] From Embedding Geometry to Spectral Search: Energy Dispersion Networks For Vector Retrieval

链接: https://arxiv.org/abs/2606.21535
作者: Lorenzo Moriondo,Ilias Azizi
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Vector spaces, such as embedding spaces that encode dense semantic information, need not be analyzed solely through pointwise geometry. They can also be interpreted as energy networks through the spectral graph induced by the topology of their column vectors, i.e., their feature-space structure. Building on this perspective, we introduce Graph Wiring, a general framework for exploiting feature-space spectral structure, together with Spectral Indexing, its task-specific instantiation for vector search. By coupling geometric similarity with spectral information, the proposed method improves head-tail coherence and semantic alignment relative to purely geometric retrieval methods. It further supports adaptive search behavior through tau-modulation, providing the flexibility increasingly required by modern Retrieval-Augmented Generation (RAG) pipelines. We present the complete algorithmic pipeline, establish its theoretical foundation through epiplexity, and evaluate the approach across benchmark and industrial settings using the open-source arrowspace library.

[IR-33] Dual-Attention Convolution Experts for Sparse Tensor Completion ECML-PKDD2026

链接: https://arxiv.org/abs/2606.21427
作者: Yanlei Liu,Zhenyu Liao
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: 18 pages, 5 figures, accepted to ECML-PKDD 2026

点击查看摘要

Abstract:Tensor factorization (TF) has been widely adopted for high-dimensional sparse data completion tasks. Despite significant progress, neural TF methods often struggle to capture complex cross-mode interactions and remain vulnerable to (extreme) data sparsity. To address these challenges, we propose a novel neural tensor factorization approach, termed Dual-Attention Convolution Expert Networks with Group-Level Contrastive Learning (DCGC). For the first problem, DCGC generates diverse non-linear alignment patterns of latent factors via a multi-channel convolution network, and leverages the gated dual-attention mechanism to drive the model to focus on more important output channels (i.e., convolution experts) and the aligned features. Furthermore, DCGC introduces a group-level contrastive learning strategy that aggregates positive samples with identical feedback levels while separating negative samples across different levels. This strategy injects high-quality self-supervised signals to mitigate data sparsity. Extensive experiments conducted on five datasets demonstrate that our DCGC outperforms the state-of-the-art methods in sparse tensor completion for traffic and recommendation applications. Code to reproduce the experimental results in the paper is available at this https URL.

[IR-34] A Rank-One Popularity Component in Dot-Product Recommender Scores:Population Theory and Prior-Separation Evidence

链接: https://arxiv.org/abs/2606.21275
作者: Yang Cheng
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Representation anisotropy in recommender systems is often attributed to Transformer architectures. We identify a more general source in the conditional training distribution. For any encoder using a dot-product softmax decoder, the population-optimal score decomposes into pointwise mutual information, an item-marginal term log p(i), and a context-dependent offset. After centering, the item marginal produces a context-shared rank-one score component, while time-varying marginals induce a low-rank popularity subspace. This score-level result does not imply universal embedding collapse because its transfer to embeddings depends on factorization geometry. Experiments on synthetic data and public Alibaba and Tianchi interaction logs support the proposed mechanism. Separating log p(i) from the learned dot product reduces the measured popularity-aligned score energy by 98.6 percent in a matched intervention. Permutation tests confirm that this reduction is specific to the empirical popularity direction. These results explain a class of apparent representation degeneration as a decoder-level consequence of long-tailed item marginals rather than a property unique to Transformer encoders.

[IR-35] PulseCX: Breaking the Closed-World Assumption in Real-Time CX

链接: https://arxiv.org/abs/2606.21124
作者: Rajat Agarwal,Suvidha Tripathi,Shubham Sharma
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Conversational AI agents in Customer Experience (CX) typically suffer from a Closed-World Constraint, ignoring high-velocity external shifts like viral trends or outages. Ad-hoc web search attempts to bridge this gap but often introduce prohibitive latency and context poisoning. We introduce PulseCX, a framework that decouples knowledge acquisition from consumption. Adopting a structure-first paradigm, PulseCX employs an asynchronous agent to linearize signals into a Decay-Aware Temporal Knowledge Graph (DA-TKG) governed by reinforcement–decay dynamics to actively manage information lifecycles. By coupling this self-evolving memory with hierarchical intent gating, PulseCX removes synchronous search bottlenecks (10ms overhead) and drives significant gains in Intent Resolution (IRR) and Customer Satisfaction (s-CSAT) in dynamic environments.

[IR-36] he Token Tax of Epistemic Accuracy: Comparing RAG and Long-Context Architectures for Document-Grounded Generative AI Applications

链接: https://arxiv.org/abs/2606.20898
作者: Austin Hamilton,Ryan Singh,Michael Wise,Ibrahim Yousif,Arthur Carvalho,Zhe Shan,Mohammad Mayyas,Lora A. Cavuoto,Fadel M. Megahed
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Document-grounded assistants built on large language models are increasingly used in high-stakes, knowledge-intensive work. Their usefulness, however, may depend on how evidence is allocated before generation. We investigate such a claim by comparing two grounding architectures: (a) retrieval-augmented generation (RAG) that retrieves a few relevant passages, and (b) long-context prompting, which loads the whole document collection in context. We view these as two regimes of “epistemic access” on an accuracy–cost frontier. We use “epistemic accuracy” to capture model correctness that depends on having the right evidence. We posit that broader access (via long context) can increase it, but with a “token tax” (i.e., a substantial increase in cost due to larger input token consumption). We probe this framing with a case study in manufacturing safety training. Using an expert-validated benchmark, we evaluate 972 answers across three machines, two small language models, and three retrieval/in-context prompting approaches. Long-context prompting achieved the highest correctness (73.1% vs. 65.4% for semantic RAG), but at 26 times the per-query token cost. We interpret this gap as the token tax of broader evidentiary access. We carefully discuss the implications of our findings for resource-constrained organizations.

[IR-37] opic-to-Timestamp Alignment by Constrained Evidence Selection

链接: https://arxiv.org/abs/2606.20890
作者: Zeynep Yılbırt,Marina Litvak,Michael Färber
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Meeting archives are difficult to search when users remember what was discussed but not when. We study topic-to-timestamp alignment: given a natural-language topic and a timestamped meeting transcript, the goal is to return the time at which the topic is discussed. A standard RAG setup can retrieve relevant transcript excerpts, but still asks the language model to generate a timestamp, which can produce unsupported or invalid timecodes. We therefore recast timestamp prediction as constrained temporal candidate selection: the system retrieves timestamped transcript chunks, and the model selects the candidate that best grounds the topic instead of generating a timecode. On 420 topic-timestamp queries from 200 municipal meeting transcripts, this increases Recall@5 from 31.9% to 50.0%, reduces MAE from 837.0 seconds to 761.0 seconds with Mistral-7B-Instruct, and increases the number of parseable outputs from 373 to 419 of 420 queries. The results suggest that temporal grounding in long transcripts depends strongly on retrieval quality and output design, not only on the choice of the language model.

人机交互

[HC-0] Why Machines Misread Pedagogical Quality: Human-Machine Alignment in LLM -Based Pretest Question Evaluation

链接: https://arxiv.org/abs/2606.23629
作者: Pei-Yu Tseng,Mahir Akgun,Peng Liu
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Designing effective pretest questions is challenging at scale: high-quality questions require careful calibration of openness, cognitive depth, and alignment with learning objectives, yet generating and evaluating them manually is time-consuming. We present an AI-assisted workflow for pretest question development that combines automated generation, rubric-based evaluation, and iterative selection. Because the workflow relies on machine evaluation to filter questions at scale, we investigate the alignment between human and machine judgments across a 2x2 design varying rubric operationalization and evaluation mode. Our findings show that human-machine disagreements are systematic rather than random, that rubric revision has a larger effect on alignment than rationale-first evaluation, and that the two interventions are complementary. These findings highlight that scalable AI-assisted pretesting depends not only on generation capability but on how pedagogical quality is operationalized for machine interpretation.

[HC-1] Hallucinations in Organization-backed AI advisors: Evidence about Skepticism Verification and Reliance in Goal-Directed Use

链接: https://arxiv.org/abs/2606.23491
作者: Simon J. Blanchard,Aaron M. Garvey,Laura O’Laughlin
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI systems are increasingly used by organizations to deliver information to consumers, patients, students, employees, and citizens. These systems can hallucinate, producing plausible but inaccurate responses. A central question for AI-advised decisions is therefore not only whether users rely on inaccurate information, but whether they recognize that a response may require verification. To answer this question, we review emerging empirical evidence relevant to hallucination detection in goal-directed interactions, with a focus on organization-backed AI advisors. We distinguish three constructs that existing studies often conflate: whether users are skeptical of information presented, whether they check it, whether checking succeeds, and whether the result of user verification affects reliance on the information. Across studies examining product search, medical decision-making, content generation, and chatbot-assisted tasks, several patterns emerge. Nearly all studies measure reliance, while variables such as user skepticism and verification of the information are more often targeted by an intervention than measured directly. The cues used to prompt scrutiny of the AI response are predominantly related to the AI output, such as source citations, and the most deployable of these AI output interventions for organizations (general and specific warnings about the risk of hallucinations) show the weakest and most mixed effects in the studies reviewed. Although the existing literature posits that users may be more likely to scrutinize responses related to particular areas of content, no studies varied the content category, leaving this question open for further research. In future research, measuring skepticism and verification separately from reliance may clarify what current evidence shows, what it only implies, and which questions require further exploration.

[HC-2] owards a Bathroom-Centered Human-Building Digital Twin Framework for Indoor Safety Analysis

链接: https://arxiv.org/abs/2606.23292
作者: Yuanzhi Su,Huiying(Cynthia)Hou
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Bathroom use is a critical safety challenge for older adults because wet surfaces, constrained layouts, limited support, and frequent posture transitions are concentrated within a small domestic space. These conditions create risks that cannot be adequately understood by considering either the bathroom environment or human motion in isolation. Existing bathroom safety studies mainly identify hazards, accessibility problems, or design modifications, whereas human-centered sensing studies often focus on activity recognition or fall detection without sufficient semantic understanding of the surrounding environment. This separation limits the interpretation of how older adults interact with fixtures, support surfaces, wet areas, and spatial constraints during daily bathroom activities. To address this gap, this study proposes a bathroom-centered human-building digital twin framework for interaction-aware indoor safety analysis with a specific emphasis on older adult bathroom safety. The framework conceptualizes bathroom risk as a coupled human-environment process and integrates semantic bathroom representation, skeleton-based human representation, spatial-semantic coupling, interaction-aware event analytics, and safety-oriented visualization. A Unity-based proof-of-concept prototype is developed to demonstrate the feasibility of the framework. Although the current work remains a prototype-oriented investigation, it establishes a methodological basis for analyzing older adults’ bathroom safety through explicit body-environment relations and for advancing privacy-sensitive, interaction-aware digital twin applications in aging-in-place residential environments.

[HC-3] When Suspicion Becomes Detection. Folk Deception Cues and Detection Strategies in Online Dating Romance Scams

链接: https://arxiv.org/abs/2606.23241
作者: Sima Amirkhani,Jana Krüger,Dave Randall,Gunnar Stevens,Douglas Zytko
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The growth of mobile dating platforms has coincided with a rise in romance scams, in which offenders construct convincing personas to defraud users. While research on romance scams is expanding, victims lived experiences of recognizing and responding to deception in mobile-mediated interactions remain insufficiently understood. To address this gap, we conducted indepth interviews with 24 victims of online dating romance scams in Iran, where legal, social, and cultural constraints limit formal support. Our analysis identifies suspicion cues and the investigative strategies victims use to verify identities across platforms. We show that victims are not passive recipients of deception but engage in active, iterative detection practices under significant emotional, social, and relational pressure. Based on these findings, we contribute empirically grounded insights into deception cues and user driven detection work, and we discuss implications for the design of mobile technologies that better support users in identifying, resisting, and recovering from romance scams. Content Warning, This paper discusses sexual violence

[HC-4] Students Perception Accuracy of Partners AI Use and its Relation to Collaboration Performance

链接: https://arxiv.org/abs/2606.23237
作者: Laura Graf,Ramona Beinstingel,Stephan Kusche,Oleksandra Poquet
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Collaborative assignments are a cornerstone of programming education. Effective collaboration during a programming project depends on the formation of reasonably accurate beliefs about how each partner works. Generative AI tools, now widely used by undergraduate students, have introduced a consequential and largely invisible new dimension into collaboration: each student’s use of AI. When partners collaborate remotely, they interpret partners’ ability and effort through their code. This raises the question of how accurately students perceive each other’s AI use in collaborations, and if a misalignment in these perceptions relates to team performance. To address this question, we conducted a three-wave longitudinal study of 103 student pairs in an introductory software engineering course. We found that greater misalignment between partners’ beliefs about each other’s AI use early in the project was associated with lower final project scores. The effect of such misaligned perceptions is the strongest in teams with lower prior programming performance, suggesting that low performing students pay a higher cost of misaligned perceptions. The perception misalignment does not consistently decrease through face-to-face pair-programming sessions. This suggests that ways to foster transparency may be needed to support student teams in collaborative programming.

[HC-5] Machine-knittable Magnetically-Plug-n-Play E-Textile Prototyping

链接: https://arxiv.org/abs/2606.22800
作者: Yifan Li,Ryo Takahashi,Wakako Yukita,Irmandy Wicaksono,Kanata Matsutani,Yuhiro Iwamoto,Sunghoon Lee,Tomoyuki Yokota,Takao Someya,Yoshihiro Kawahara
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Electronic textiles (e-textiles) integrated with wearable sensors are essential for daily motion monitoring and long-term physiological sensing. For example, capturing optimal kinematic or bio-signals requires aligning sensors with specific anatomical parts, which vary significantly across individuals and application scenarios. This necessity for personalization makes e-textile prototyping inherently iterative, however current fabrication methods, such as manual conductive stitching, rely on permanent bonds that restrict rapid adjustment. This paper introduces Plug-n-play e-knit, a machine-knittable e-textile prototyping platform that enables repeatable, quick adjustment of sensor positions across garments. First, to cover the large area of the textile for prototyping, we use industrial digital knitting of conductive yarn to integrate power and communication buses directly into the large-scale textile. Then, to ensure plug-n-play attachment to the textile, we employ soft-magnetic connectors that enable sensors to be repeatedly plugged into the wiring without damaging the fabric. Furthermore, our LED-positioning system enables the automatic identification and localization of each sensor node. We demonstrate the platform’s capabilities through forearm movement calibration and position-aware temperature mapping.

[HC-6] Supporting Tutors in the Gig Economy with Automated Feedback: A Case Study on Ringle

链接: https://arxiv.org/abs/2606.22609
作者: Yeon Su Park,Sieun Kim,Keighley Overbay,Seoyoung Kim,Sewook Wee,Daho Jung,Juho Kim
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rise of online tutoring platforms in the gig economy has made education more scalable, flexible, and on-demand. These platforms rely on learner evaluations as the primary feedback for tutors and platforms. However, such feedback offers limited guidance for tutors’ improvement and makes it difficult to monitor tutor quality at scale. To this end, we explored AI-powered automated feedback and how tutors perceive and respond to it. We deployed a research probe on Ringle, a popular online English tutoring platform, that analyzed tutors’ lessons and provided automated feedback. We then surveyed 36 tutors about their experience. Our findings reveal that while tutors perceived automated feedback more negatively than learner feedback, they found it useful for self-monitoring and understanding platform expectations, though discrepancies between them often caused confusion. Based on these insights, we propose design considerations for feedback systems for online educational gig platforms.

[HC-7] MacAgent Bench: Benchmarking AI Agents on Real-World macOS Desktop

链接: https://arxiv.org/abs/2606.22557
作者: Yikun Fu,Bowen Fu,Zhenyu Wu,Shuang Cheng,Xiaowei Sun,Bowen Yang,Zehao Li,Yibo Zhao,Zichen Ding,Zhoumianze Liu,Shijie Wang,Biqing Qi,Bowen Zhou
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Computer use agents (CUAs) have advanced rapidly in desktop automation, and a growing number of users deploy CUAs such as OpenClaw on Mac Mini for always-on automation. However, existing benchmarks, including those for macOS, evaluate agents without framework augmentation and rely on binary evaluation. As a result, they fail to capture both the framework capabilities leveraged by modern CUAs and the partial progress on long-horizon, multi-application tasks. We present MacAgentBench, a comprehensive macOS agent benchmark comprising 676 tasks across 25 applications, with nearly 60% involving both GUI and CLI interaction. The benchmark adopts deterministic rule-based evaluation and introduces fine-grained multi-checkpoint scoring with capability annotations for multi-application tasks. Experiments across three frameworks and 16 models show that the best configuration, Claude Opus 4.6 on OpenClaw, attains 73.7% Pass@1, while this advantage is primarily driven by the skill library rather than by framework design. Fine-grained metrics further reveal that models with similar Pass@1 can differ substantially in sub-goal completion. Our code and data are publicly available at this https URL.

[HC-8] Human and AI collaboration for pulmonary nodule segmentation

链接: https://arxiv.org/abs/2606.22486
作者: Hongqiao Dong,Wenhao Chi,Ruobing Liang,Xiaokui Yang,Wenhua Liang,Peng Hou,Wenjun Pu,Yipeng Zhao,Ping Chen,Haiping Liu,Jianxing He,Bo Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Medical expert annotators are scarce, and blind reliance on artificial intelligence (AI) can be misleading, motivating approaches in which humans, particularly junior medical trainees or even non-medical personnel, collaborate with AI to achieve robust medical segmentation. Although the Segment Anything Model (SAM) shows promise for general-purpose image segmentation, its performance in human-AI collaboration for specialized medical tasks has not been thoroughly evaluated. Here we present Hi-Seg, a human-in-the-loop segmentation framework for pulmonary nodules built on SAM. Humans iteratively refine prompts through trial-and-error learning and semantic reasoning, progressively guiding SAM toward higher-quality masks. Using chest CT scans from 1,179 patients across 12 centers, we conducted the first large-scale external validation of collaborative human-SAM segmentation. Across all annotator groups, Hi-Seg achieved a mean Dice score of almost 85%, outperforming five state-of-the-art deep learning models by 10-22% and 13 SAM variants by 1-29%. Hi-Seg improved segmentation accuracy while reducing annotation time for medical annotators, and briefly trained non-medical annotators achieved performance comparable to that of the junior medical student. These findings suggest that human-in-the-loop segmentation can reduce clinician workload, enable scalable crowdsourced annotation, and transform clinical workflows by facilitating the safe and efficient integration of foundation models into routine clinical practice.

[HC-9] Governed AI-Assisted Engineering: Graduated Human Oversight for Agent ic Code Generation in Regulated Domains

链接: https://arxiv.org/abs/2606.22484
作者: Richard Kang
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The adoption of agentic AI coding systems – where autonomous agents generate, review, test, and deploy code with minimal human intervention – creates a governance challenge in regulated industries. Existing frameworks address AI-assisted development maturity or the productivity-reliability tension but offer no mechanism for calibrating human oversight intensity to regulatory impact. We present the Governed AI-Assisted Engineering (GAIE) framework, a three-tier graduated human oversight model for agentic code generation in regulated domains. GAIE introduces the Oversight Classification Model (OCM), a deterministic decision function that classifies code generation tasks by regulatory impact, customer proximity, reversibility, and data sensitivity to route them through one of three oversight tiers: human-in-the-loop (strategic functions), human-over-the-loop (customer-impacting), or automated-with-monitoring (internal). Each tier defines required evidence artifacts for compliance auditability. We map GAIE against the Bank of Thailand’s 2025 AI risk-management policy and demonstrate cross-jurisdiction applicability to MAS (Singapore), NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Evaluation through regulatory coverage analysis, comparative framework analysis, and analytical productivity modeling suggests that graduated oversight preserves 84–97% of agentic coding velocity (central estimate: 91%) while maintaining compliance evidence coverage for regulated functions. GAIE contributes a framework that explicitly bridges AI-assisted development maturity with regulatory governance through proportionate human oversight.

[HC-10] A Taxonomy of Conceptual Alignment in Human-Robot Dialogue

链接: https://arxiv.org/abs/2606.22360
作者: Shengchen Zhang,Xiaohua Sun,Weiwei Guo
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 8 pages, 2 figures. To be presented at RO-MAN 2026

点击查看摘要

Abstract:Successful conversations require speakers to align on the meaning of concepts, a challenging but crucial task for human-robot interaction. Understanding the process of establishing such alignment is hindered by competing interpretations of the term and isolated, unidirectional investigations of its design space. This paper argues for a design-centric understanding of conceptual alignment as a bidirectional and co-constructive process. We introduce a taxonomy that characterizes conceptual alignment dialogues along what triggers its initiation and what level(s) of conceptual understanding it concerns. We further present a dialogue act schema as an operational tool that captures the interactional moves through which alignment is achieved. Together, these contributions provide a structured foundation for analyzing, comparing, and designing conceptual alignment in human-robot interaction.

[HC-11] Curiosity as Linguistic Intervention: Using LLM Tutoring Dialogues to Influence Exploratory Learning Behavior EMNLP2026

链接: https://arxiv.org/abs/2606.22349
作者: Gevindu Ganganath,Pasindu Bolonghege,Qianru Lyu,Pradeep Varakantham,Thivya Kandappu
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Submitted to EMNLP 2026

点击查看摘要

Abstract:Large Language Models (LLMs) provide a new opportunity to study how language shapes exploratory cognition because conversational strategies can be systematically manipulated at inference time. We introduce CURIOBOT, a framework that operationalizes Berlyne’s collative variables, novelty, complexity, conflict, and uncertainty, as adaptive linguistic interventions for conversational tutoring. Across 270 tutoring conversations spanning multiple model families, domains, and topic complexity levels, curiosity-oriented interventions consistently increased exploratory learner behaviors, producing up to 2.4x more conversational turns under fixed time budgets. To measure these effects, we further introduce a learner-centered evaluation framework capturing exploratory questioning, conversational agency, productive struggle, and observable curiosity. Learner-side gains persisted even when tutor-side instructional quality remained unchanged, suggesting that curiosity functions as a partially independent interaction-level mechanism. More broadly, our results demonstrate that LLM-mediated dialogue can serve as a scalable experimental framework for studying how language shapes exploratory learning behavior.

[HC-12] What Changes When the Interlocutor Is an AI? Interactional Fluency and Linguistic Uptake in L2 Spoken Dialogue

链接: https://arxiv.org/abs/2606.22225
作者: Russell Scheinberg,Ameeta Agrawal,Tetyana Sydorenko,Kalab Kahsay,Nina Vyatkina,Griet Boone
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at Educational Data Mining 2026

点击查看摘要

Abstract:Voice-based AI systems are increasingly used for L2 speaking practice, but evaluations rarely characterize the interactional processes they create. We analyze 78 university learners of German across four sites completing a counterbalanced spot-the-difference task with both a human peer and a real-time AI partner. From diarized ASR transcripts, we extract measures of interactional fluency, linguistic uptake, and learner experience. Human dialogue was faster and more balanced, with many short turns; AI dialogue resembled supported monologue, with fewer, longer turns, reduced learner floor share, and greater within-turn fluency. The AI’s verbose, syntactically regular input was associated with greater short-term uptake and stronger syntactic priming after controlling for input volume. Attitudes toward AI improved after the task, and satisfaction was predicted by production fluency rather than uptake. The results show complementary affordances for AI and human dialogue in L2 practice.

[HC-13] Open AI in the Wild: Adoption and Adaptation of Open Models on r/LocalLLaMA

链接: https://arxiv.org/abs/2606.22211
作者: Woohyeuk Lee,James Howison,Min Kyung Lee,Hanlin Li
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at FAccT’26

点击查看摘要

Abstract:Existing work on AI openness has focused on defining what technical components or release practices qualify a system as “open”. However, less is known about how openness is understood and put into practice by people who adopt and adapt these models under real-world constraints. In this paper, we present an empirical study of r/LocalLLaMA, a large online community centered on running and customizing open foundation models locally. Through thematic analysis of community discussions, we find that members conceptualize openness pragmatically - in relation to reliability, local control, privacy, and the ability to adapt models under constraints such as compute resources, licensing, and usability. We identify key motivations for adopting open models, including autonomy, experimentation, and resistance to platform instability, as well as deterrents such as steep learning curves and performance gaps compared to closed systems. We further describe how shared resources and projects, including datasets, evaluation frameworks, and inference tools, sustain interdependent development in the broader open AI ecosystem beyond individual model releases. We then discuss the implications of a utility-oriented view of openness, and how producer support for downstream usability and infrastructure could better enable sustained innovation in open model ecosystems.

[HC-14] raceView: Interactive Visualization of Agent ic Program Repair Trajectories

链接: https://arxiv.org/abs/2606.22110
作者: Amirali Sajadi,Tu Nguyen,Kimmie Huynh,Esteban Parra,Preetha Chatterjee
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:LLM-based automated program repair (APR) agents generate patches to fix software bugs with minimal human intervention. These agents often produce long trajectories of reasoning, tool use, and feedback to produce candidate patches. Final patch outcomes show whether a repair attempt succeeded or failed, but they do not show how the agent reached that outcome, or where the process became repetitive or misaligned with the task. This makes agentic repair failures difficult to diagnose, reproduce, and prevent. To help developers address these challenges, we present TraceView, an interactive tool for labeling and visualizing repair trajectories from APR systems. TraceView organizes raw and pre-labeled agentic runs with Thought, Action, and Result components to support semantic relation labeling and diagnosis, and renders the resulting trajectory as graph views. Furthermore, TraceView provides relation filters, patch outcome summaries, metrics, and node-level evidence panels to help users inspect how reasoning, actions, and feedback connect across the various steps of an agentic repair attempt. We evaluate TraceView with five researchers through a survey-based user study. Participants reported that TraceView made trajectories easier to scan and that its overview-to-detail workflow helped them better understand repair behavior. The TraceView source code is available at this https URL. A screencast of TraceView is available at this https URL.

[HC-15] he Cognitive Trajectory Laboratory: Modeling the Creative Process Through Time in Art Therapy

链接: https://arxiv.org/abs/2606.22057
作者: Nicholas Davis
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Art therapy has demonstrated effectiveness across diverse clinical populations, and its theoretical traditions have generated valuable perspectives on symbolism, expression, narrative reconstruction, meaning-making, physiological responses, and neurobiological processes. While these approaches provide important accounts of therapeutic experience and change, they have placed comparatively less emphasis on how cognition, regulation, and interaction dynamics evolve during the creative process itself, making it difficult to analyze how creativity and therapeutic outcomes emerge through time. As a result, art therapy research continues to rely heavily on qualitative interpretation, outcome measures, and retrospective self-report, while the dynamics of therapeutic change remain difficult to quantify. This paper proposes an enactive, dynamical framework for understanding and measuring cognitive change in art therapy through the analysis of creative interaction dynamics over time. Within this framework, therapeutic change is hypothesized to be reflected in cognitive trajectories, temporally unfolding patterns of engagement that reveal shifts in stability, exploration, and adaptation. To operationalize this framework, the paper introduces the Cognitive Trajectory Laboratory (CTL), an instrumented drawing environment that transforms interaction traces into cognitive trajectories unfolding through time, enabling the identification of emergent properties, significant events, and overarching chapters of the creative process. By making the dynamics of creative engagement measurable, the proposed framework and accompanying laboratory provide new methodological tools for art therapy assessment and research while creating opportunities for longitudinal analysis of therapeutic change. Implications are discussed for process-oriented evaluation and computational modeling of creative engagement.

[HC-16] Old Fictions New Skins: Evaluating the Manipulative Capabilities of LLM s in Healthcare

链接: https://arxiv.org/abs/2606.21977
作者: Gathoni Ireri,Roger D. Odipo
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 28 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly piloted in African healthcare contexts, raising concerns about their potential to manipulate users in high-stakes settings. In a randomised experiment, we examined the manipulative capabilities of two publicly available models, ChatGPT 5.2 and DeepSeek V3.2, among Kenyan participants (N = 303). Participants interacted with either a manipulative variant or a non-manipulative variant before making a treatment decision within a hypothetical clinical scenario. The manipulative variant was prompted to covertly steer participants towards an incorrect treatment option while the non-manipulative variant served as the control condition. Manipulation success rates were higher in the manipulative condition (59.5%) than in the control condition (44.0%), with the effect reaching significance (OR = 2.11, 95% CI [1.12, 4.00], p = .021). These findings highlight the need for improved safety infrastructure specifically targeting manipulation, particularly given the integration of AI into healthcare systems across Africa.

[HC-17] Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems INTERSPEECH2026

链接: https://arxiv.org/abs/2606.21970
作者: Jingjing Jiang,Atsumoto Ohashi,Ryuichiro Higashinaka
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Full-duplex spoken dialogue models, such as Moshi, enable natural, low-latency voice conversations. However, they remain limited to the audio modality, lacking the facial expressions that are integral to human communication. We present Moshi-Face, the first full-duplex dialogue model that jointly processes the user’s audio and facial input while simultaneously generating speech and facial motion. We first construct a vector-quantized variational autoencoder (VQ-VAE) as a face codec that encodes 3D head meshes extracted from facial videos into compact discrete tokens, referred to as face tokens, and conversely reconstructs 3D meshes from these tokens. We then extend Moshi with a Face Transformer module that generates face tokens non-autoregressively, enabling Moshi-Face to produce synchronized audio and face tokens in real time. Experiments show that Moshi-Face achieves audiovisual alignment at low latency while preserving the dialogue quality of the original audio-only model.

[HC-18] AI-Mediated Negotiation: Design Reflections and Lessons

链接: https://arxiv.org/abs/2606.21886
作者: Veda Duddu,Jash Rajesh Parekh,Andy Mao,Hanyi Min,Ziang Xiao,Vedant Das Swain,Koustuv Saha
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Conversational AI promises a new kind of preparation for high-stakes workplace negotiations – personalized, interactive, and capable of simulating realistic resistance. That promise is intuitive. We built Trucey, a theory-driven coaching system, to test it. The system encoded four assumptions: that articulation supports clarification, that personalization builds strategic competence, that chunked delivery reduces cognitive load, and that structured scaffolding removes metacognitive burden. A pre-registered experiment (N=267) and interviews (N=15) complicated each of them. Notably, the static handbook we included as a passive control outperformed both AI conditions on empowerment and usability. We reflect on why: each assumption encoded a specific model of how preparation unfolds, and the findings revealed that conversational AI imposes a linear execution model on a task that is fundamentally recursive. We identify an unexamined scope condition on established HAI design guidelines and close with a sequencing principle – map before path, path before simulation – for future AI coaching design.

[HC-19] Low-Vocality Engagement Shapes Online Participation

链接: https://arxiv.org/abs/2606.21665
作者: Veronica Mesina,Andrea Failla,Luca Pappalardo,Giulio Rossetti
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Online participation is often measured through visible expression, especially posting, yet many consequential forms of engagement occur through less vocal actions such as liking and following. Here we study how users inhabit Bluesky by reconstructing participation profiles from more than three billion activity records produced by a near-complete sample accounting for more than 80% of registered users. We aggregate behavior into monthly user-level observations and distinguish two dimensions that are often conflated in platform analytics: intensity, capturing how much users engage, and style, capturing how engagement is expressed across actions. We find that vocal production is highly concentrated, but low-posting behavior does not imply absence from platform participation. High-intensity engagement is most strongly associated with liking rather than posting, while posting-oriented participation is more common among low-intensity users, indicating that visibility and sustained engagement should not be conflated. Transition patterns suggest that high-intensity likers and posters could be described as attractors; network-building redirects users within the active space; whereas observed inactivity acts as a persistent boundary that selectively limits re-entry. Higher-order motifs further show that inactivity often interrupts rather than erases prior regimes, and that low-intensity liking can precede durable high-intensity engagement. These results show that online participation is structured by differentiated low-vocality practices, calling for a shift from post-centered measures of activity toward dynamic accounts of platform presence. We identify a broader challenge for computational social science: platform participation cannot be adequately understood through the behavior of vocal minorities alone.

[HC-20] Measuring What Matters: A Quantitative UX Evaluation Framework for AI-Assisted Home Search

链接: https://arxiv.org/abs/2606.21663
作者: Matilda Nkoom
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, no figures

点击查看摘要

Abstract:AI-assisted conversational search is rapidly displacing filter-based interfaces across the major home search portals. Redfin’s deployment of conversational search produced a 47% lift in tour requests, and Zillow launched “AI Mode” in March 2026. Recent consumer surveys indicate that a large majority of Americans now use AI tools for housing market information. Yet the evaluation frameworks practitioners apply to these products remain borrowed from general-purpose usability testing, tools designed for deterministic, filter-driven interfaces that do not capture the distinctive failure modes of AI-driven experiences. This paper proposes a four-layer quantitative evaluation framework purpose-built for AI-assisted home search: recommendation system quality, interaction efficiency, attitudinal measurement, and trust calibration. For each layer, validated instruments, production-derived benchmarks, and practitioner-ready implementation guidance are provided. A minimum viable metric set and a worked example illustrating the framework’s application to a mid-sized portal are included to support immediate adoption.

[HC-21] Voluntary Triggering of Shared-Autonomous Prosthetic Control via IMU-Based Motion Gestures

链接: https://arxiv.org/abs/2606.21620
作者: Aabira Zaman,Kaijie Shi,Xianta Jiang
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recently, a shared-autonomous scheme has been introduced into prosthetic hand control field, where the user provides high-level intent by moving the hand towards the target, and the artificial intelligence system autonomously executes low-level control (e.g., grasp and release the object). This system reduces user workload but risks unintended grasp or release actions without explicit user control. In particular, release actions remain challenging, as vision-based autonomous systems typically assume that proximity to a supporting surface signals the user’s intent to let go, making mid-air release tasks difficult and error-prone. This study presents an inertial measurement unit (IMU)-based gesture-triggered interface enabling voluntary initiation or override of grasp and release actions to the autonomous system. A real-time motion detection algorithm recognizes three deliberate upper-limb gestures: shoulder shrug, elbow flap, and wrist shake, across three control paradigms: autonomous, hybrid, and manual. In a controlled study with 14 able-bodied participants and one individual with an upper-limb difference, the elbow flap emerged as the most preferred gesture (66% preference) and achieved 95% mean successful rate. Manual mode produced the highest accuracy (95%), while autonomous mode and hybrid mode were most preferred for daily use (38%). Results suggest that IMU-based voluntary triggers enhance alignment between user intent and prosthetic action, improving reliability and perceived control. This approach offers a practical pathway toward safer, more adaptable prosthetic systems and can be extended to real-world applications requiring rapid, intentional overrides of autonomous behavior.

[HC-22] CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents EMNLP2026

链接: https://arxiv.org/abs/2606.21453
作者: Youngwon Choi,Hyeonyu Kim,Taeyoun Kwon,Donghyuk Jung,Myeongkyun Cho
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to EMNLP 2026 Industry Track

点击查看摘要

Abstract:Task-oriented voice agents need to map spoken user requests to structured outputs such as semantic frames, executable actions, and function calls. A common approach is to cascade ASR with a text-based LLM, but transcription errors can propagate to downstream structured output generation, especially under noisy conditions. Spoken language models (SLMs) offer a direct speech-based alternative, yet adapting them to new tasks typically requires paired speech-target annotations. Motivated by this gap, we present CORTIS, a text-only adaptation framework for task-oriented voice agents. CORTIS fine-tunes SLMs using text-form task supervision, enabling speech-based structured output generation at inference time without task-specific speech-target annotations during adaptation. We evaluate CORTIS on two Qwen2.5-Omni backbones and three task-oriented speech datasets, including an in-house product dataset, and compare it with matched ASR-LLM cascades trained with the same text-form task supervision. Results show that CORTIS performs competitively with matched cascades and offers clearer advantages under acoustic degradation, particularly in preserving high-level task semantics. These findings suggest that text-only fine-tuning of SLMs can serve as a practical adaptation strategy for voice agents when paired speech-target data are costly to collect.

[HC-23] Pixel Watch: Robust Heart Rate Sensing from Multipath PPG and On-Device Deep Learning Trained on 10000 hours of Free-Living and Fitness Data

链接: https://arxiv.org/abs/2606.21436
作者: Daniel Roggen,Megan Walker,Yojan Patel,Shyam Tailor,Dimitris Spathis,Matt Wimmer,Brennan Garrett,Dan Howe,Abhinuv Pitale,Hamed Vavadi,Tien Le,Steve Diamond,Oleksiy Vyalov,Vik Sharma,Pete Richards,Tracy Giest,Erika Siegel,Tuan Phan,Sam Mravca,Derrick Vickers,Benjamin Stone,Katarina Vukosavljevic,Justin Phillips,YongSuk Cho,Stefanie Hollidge,Antony Siahaan,Soren Brage,Shwetak Patel,Robert Harle
类目: Human-Computer Interaction (cs.HC)
备注: in IEEE Sensors Letters, 2026

点击查看摘要

Abstract:The Pixel Watch 2 (PW2) is the first Google smartwatch to combine multipath photoplethysmography (PPG) with deep learning-based heart rate inference, designed to significantly improve sensing accuracy during motion-heavy activities. The device processes 10 optical channels using an on-device, 15-layer temporally dilated convolutional neural network (~300K parameters) to yield a 1 Hz heart rate output. Crucial to this model’s performance was its training on a massive dataset comprising 10,000 hours of data from 962 participants, curated from a broader corpus of controlled and free-living activities. We evaluated the PW2’s sensing performance across two independent validation sets: an in-house fitness dataset (229 participants, 250 hours) and an external free-living dataset (27 participants, 1000+ hours). The system achieved 95% Limits of Agreement of -10.34 to 8.66 BPM during exercise and -6.57 to 7.48 BPM during free-living activities, demonstrating substantially tighter error margins than previous Google devices. Finally, we discuss key design lessons, emphasizing that large-scale deep learning was instrumental in fully leveraging multipath PPG hardware over traditional signal processing approaches.

[HC-24] Warning labels shift perceptions of sycophantic AI but not its influence

链接: https://arxiv.org/abs/2606.21317
作者: Lujain Ibrahim,Myra Cheng,Cinoo Lee,Pranav Khadpe,Desmong Ong,Dan Jurafsky,Diyi Yang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent work has raised concerns about the influence of sycophantic AI on user judgment and relationships. One proposed mitigation, which has received regulatory attention, is to warn users about potentially harmful AI behaviors such as sycophancy. In a preregistered experiment in which participants (N = 2,610) discussed real interpersonal conflicts with an AI system, we test whether warning labels mitigate sycophancy’s influence. We find that a basic AI disclosure (This chatbot is AI'') has no detectable effect. Labeling the system as sycophantic (…may agree with you and validate you even when you are wrong…‘’) does shift users’ perceptions, reducing perceived objectivity and trust, but it does not reliably reduce sycophancy’s influence on users’ self-perceived rightness or their willingness to repair the conflict. Our results reveal a gap between AI perception and AI influence: by shifting perception without reducing influence, warning-based interventions may offer a false sense of protection. Addressing the harms of sycophancy will therefore require understanding the specific mechanisms through which it shapes judgment, and improving model behavior itself.

[HC-25] Human-AI Interaction Requirements in Public Sector Procurements

链接: https://arxiv.org/abs/2606.21247
作者: Mateen A. Abbasi,Tommi Mikkonen,Sinna Pirinen,Aapo Koski
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 9 pages, 1 figure. Published in BIS 2026, Lecture Notes in Business Information Processing, vol. 584

点击查看摘要

Abstract:Public sector organizations increasingly procure AI-enabled ICT systems to support decision-making and service delivery. Although ethical AI frameworks emphasize transparency, accountability, and human oversight, these principles are rarely translated into explicit requirements in procurement processes. Consequently, human-AI interaction (HAI) is often left to vendor design choices. This paper conceptualizes HAI as a procurement-critical design dimension and proposes a taxonomy of interaction requirements tailored to public sector ICT procurement. The taxonomy enables contracting authorities to specify and govern interaction properties through procurement instruments, supporting both ethical compliance and sustainable value realization.

[HC-26] Reducing the rate of personal insults in social media with bystander bots

链接: https://arxiv.org/abs/2606.21043
作者: Libby Hemphill,Lingyao Li,Ryan Burton,David Jurgens
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Prompted by previous research on strategies for reducing interpersonal conflict and addressing problematic behaviors in online communities, a randomized controlled trial on Reddit compared various responses for reducing the rate of personal insults users post to the site. We generated replies from five deescalation strategies and used an automated procedure for posting them as replies to insulting comments. The findings reveal that automated replies to insults can effectively reduce their rate. Appreciation performed best. Not all strategies performed well, though. We conclude that automated responses are a viable tool for addressing some problematic behaviors. We discuss their potential utility and limitations.

[HC-27] How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs ICML2026

链接: https://arxiv.org/abs/2606.20978
作者: Honjar Xing,Jefferson Lin,Henry Lieberman
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the 5th Deep Learning for Code (DL4C) Workshop, ICML 2026. 8 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Programming by Demonstration (PbD) offers a human-centered way to author procedural knowledge for LLM agents: users communicate what they want by showing rather than by writing prompts or code, making agent authoring accessible to non-programmers. The natural output of a PbD recording is a flat action log, but how this log is organized before being passed to the agent is an open design question with significant consequences for plan quality. We propose grouping recorded actions into labeled, hierarchical subgoals and evaluate the effect of this organizational structure in a controlled experiment. Across 85 web automation tasks, we compare a zero-shot baseline against four demonstration formats that share identical action sequences but differ in structure. On 43 natural-language tasks with vague descriptions, hierarchically grouped demonstrations improve pass rates from 76.7% to 90.7% (paired permutation test p=0.034 ; win-loss 6:0), while flat demonstrations show a smaller, non-significant improvement. On 42 tasks with precise descriptions, no format provides any benefit, confirming that the hierarchical advantage arises specifically when descriptions leave procedural details ambiguous. Ablation shows that subgoal grouping alone drives the effect: preconditions, postconditions, and parameter annotations add no measurable benefit. These results offer a concrete design recommendation for PbD pipelines and, more broadly, for any system that feeds procedural context to an LLM agent: segment action sequences into named subgoal groups rather than presenting flat step lists.

[HC-28] Co-Construction Blindness and Asymmetric Epistemic Vulnerability in Human-LLM Interaction

链接: https://arxiv.org/abs/2606.20762
作者: Bianca Helena Ximenes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, out of which 2 are transcripts; Target venue: CHI 2027

点击查看摘要

Abstract:This paper introduces two constructs to describe, as far as we know, a previously unnamed risk in human-LLM interaction. Co-construction blindness is the failure to recognize that LLM outputs are not independent assessments to be verified, but co-constructed artifacts shaped by the user’s own inputs, accumulated history, and metadata. Every user of a conversational LLM is IN the loop, not ON it – yet every deployment disclaimer positions them as external auditors. Asymmetric epistemic vulnerability describes the condition in which co-construction blindness produces consequences of radically different magnitude depending on where in the authority structure the user sits. We argue that these constructs describe a structural inevitability, not an anomaly, using the public case of Richard Dawkins’s interaction with Claude as a paradigmatic instance. We document a secondary mechanism – structural deference – through a first-person exchange in which a large language model concedes that it treated Dawkins more gently than warranted because his intellectual output is represented in its training data. We map the research gaps this analysis opens and call for shared terminology as a precondition for appropriate governance and design response.

[HC-29] oward Machine Risk Perception: Integrating Trust Calibration and Precursor-Based Risk Estimation for Humanoid

链接: https://arxiv.org/abs/2606.20748
作者: He Wen
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Humanoid robots are emerging as co-workers in smart manufacturing, yet their dynamic, human-like movements introduce safety risks that differ fundamentally from those of fixed or wheeled robots. Conventional safety paradigms based on reactive force or distance limits fail to capture the sequential, uncertain nature of humanoid failures. This study proposes a precursor-driven, trust-calibrated framework to enable proactive humanoid risk perception. Accident evolution is modeled through sequential precursor cues using a Logistic-Exponential (LE) formulation that couples logistic escalation from diverse precursors with exponential decay for temporal dissipation. Trust is defined as the inverse of the estimated accident probability, allowing humanoids to adapt behavior in real time, reducing aggressiveness when risk intensifies, and restoring confidence as stability returns. A multi-source dataset of 126 documented events and 241 precursors revealed twelve dominant accident modes, most evolving through overlapping cues within one second. A simulated case study (“fall-onto-human”) demonstrated how the LE-Trust coupling can trigger early intervention and prevent collapse. The results advance humanoid safety from static thresholds toward dynamic, evidence-based inference, establishing a foundation for risk-aware and trustworthy human-robot collaboration in Industry 5.0 environments.

[HC-30] Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

链接: https://arxiv.org/abs/2606.20708
作者: Liang Chen
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:LLM-as-user-simulation has become core infrastructure for conversational AI: agent benchmarks (tau-bench), training pipelines, and a growing body of fidelity studies all rely on LLMs role-playing the human side of dialogue. Existing frameworks measure communicative fidelity – whether simulators talk like humans – against ground truth from paid participants role-playing assigned goals. We argue this has a structural blind spot: when the goal is assigned, the user’s willingness is exogenous, so no framework can test whether simulators make decisions like real users whose motivation is endogenous, latent, and decaying. We introduce decision fidelity – whether a simulated population reproduces the decision-state dynamics of real users facing real, consequential choices – and measure it on a unique testbed: 2,790 production conversations between an LLM sales agent and real customers, including 793 with verified payment outcomes. Using a teacher-forced probe protocol that holds context and instrument fixed, we find a systematic, outcome-correlated failure we call the disengagement deficit: simulators reproduce eventual buyers almost exactly (depth bias +0.09) but inflate eventual non-buyers toward the purchase frame (depth bias +0.40; d=0.38, p0.001), halving expressed resistance (25.1% to 13.5%) and nearly doubling deliberation (21.9% to 40.1%) while fabricating no purchases. The deficit replicates across model families (DeepSeek: d=0.41, p=0.002) and resists the obvious fix: instructing the simulator that it may disengage cuts marginal bias five-fold but barely moves the outcome-conditioned contrast (d=0.34, p=0.008). Real non-buyers say “not now” and stop; simulated non-buyers ask about price. Evaluating or training sales and persuasion agents against such simulators overstates funnel progress exactly where it matters most – the customers who walk away.

[HC-31] Design Principles for Human-Agent Interaction

链接: https://arxiv.org/abs/2606.20630
作者: Haiyi Zhu,Canwen Wang,Qing Xiao,Hong Shen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI agents are rapidly evolving into autonomous systems capable of sustained interaction, tool use, and long-term collaboration. Yet their real-world adoption remains limited, suggesting that the key barrier lies not only in technical capability but also in a lack of design knowledge for successful human-agent interaction. This position paper argues that AI agents should not be solely evaluated or deployed based on autonomous task capability alone; because agents interact with, adapt to, influence, and sometimes fail humans, human-agent interaction must be treated as a core design and evaluation target for agentic AI. We present 14 design principles that articulate the ideal human-agent relationship across four interaction stages: initially, during interaction, over time, and when things go wrong. We use these principles to evaluate nine agent systems to illustrate that these design principles can provide actionable guidance for AI design teams to systematically design and evaluate agents that are usable, trustworthy, and effective in real-world interactive settings.

[HC-32] LLM 4CAD-Editor: An Intent-Aware Large Language Model Framework for Multi-Level Computer-Aided Design Editing

链接: https://arxiv.org/abs/2606.20607
作者: Yuewan Sun,Zhenghui Sha
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently enabled automatic generation of parametric computer-aided design (CAD) programs from natural language. However, real-world CAD workflows are inherently iterative and require reliable editing rather than one-shot model synthesis. In this work, we propose LLM4CAD-Editor, an LLM-based intent-aware framework for instruction-guided CAD editing based on a structured domain-specific language (LLM4CAD-DSL). The symbolic representation of LLM4CAD-DSL enables robust geometric modification through a feature-level entity selection mechanism, allowing models to reference geometry via feature names instead of coordinates, thus transforming fragile coordinate-based reasoning into natural language-based reasoning that many LLMs can handle. We construct a multimodal CAD editing dataset with over 35,139 instruction-program pairs via DSL-based augmentation and vision-language instruction synthesis, covering functional-, operation-, and parameter-level editing intents. To validate the work, we fine-tuned a 32B-parameter language model for DSL editing generation. Experimental results show high parsing accuracy for parameter-level edits (96.3%) and strong intent satisfaction rates of 82% for functional instructions. The model also achieves an average Intersection-over-Union (IoU) of 0.935 for parameter-level edits, 0.871 for operation-level edits, and 0.708 for functional-level edits, while the corresponding average editing distances are 0.176, 0.579, and 2.859, respectively. Comparative studies further demonstrate a significant improvement in editing robustness by 1.4x over Python-based CAD scripting approaches. These results confirm that LLM4CAD-Editor can reliably perform both low-level parameter modifications and high-level functional edits, maintaining high accuracy and low structural errors across diverse editing tasks.

[HC-33] Zhinong AI: A Design-Science Study of an AI-Enabled Agricultural Decision-Support Platform for Smallholder Production

链接: https://arxiv.org/abs/2606.20601
作者: Zhaoyang Li,Jiaqi Liu,Ruijie Zhang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence is increasingly moving from single-purpose agricultural recognition tools toward integrated decision-support systems that connect information access, diagnosis, task execution and post-action feedback. This paper presents a design-science case study of the Zhinong AI Agricultural Decision Platform, a farmer-facing system that integrates agricultural information push services, natural-language question answering, image-based crop disease diagnosis, plot and farming-calendar management, workflow orchestration, a Hainan Free Trade Port agricultural service zone and an age-friendly care mode. Based on public project materials, policy context and prior research on smart agriculture, machine learning and design science, the paper constructs a layered system architecture and a closed-loop decision process summarized as sensing, analysis, planning, execution and feedback. It further proposes a function-pain-point mapping matrix, an evaluation indicator system and a governance framework covering data provenance, model risk, expert review, privacy and adoption risk. The study does not claim measured field performance because production logs, controlled user studies and expert-labeled local image datasets were not available at the time of writing. Instead, the contribution is a structured research framework for transforming an AI agricultural prototype into an empirically testable, accountable and localized decision-support infrastructure for smallholder production.

[HC-34] Using Biometrics to Understand AI-Assisted Coding Performance and its Perception

链接: https://arxiv.org/abs/2606.20598
作者: Paolo Burelli,Fabio Calefato,Daniela Grassi,Mihaela Yurieva Hristova,Nicole Novielli,Alberto Antonio,Romano,Paolo Tell
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Stage 2 RR under review at EMSE. The accepted Stage 1 protocol is publicly archived on OSF (DOI: https://doi.org/10.31219/osf.io/mex39_v2 )

点击查看摘要

Abstract:AI-based code assistants are transforming software development, yet we lack empirical evidence on how they affect developers’ cognitive processes. We present a multisite study investigating the neurophysiological correlates of AI-assisted programming through a within-subjects crossover design. We recruited participants at two universities (Bari, Italy, and Copenhagen, Denmark) and collected electroencephalography, eye-tracking, electrodermal activity, and heart rate variability data alongside a rubric-based performance score and self-reported workload across six dimensions using the NASA Task Load Index (NASA-TLX). We tested four hypotheses addressing physiological differences between AI-assisted and non-assisted conditions, the moderating role of developer experience, the association between physiology and performance, and the alignment between subjective perceptions and objective measures. Under AI assistance, the EEG \theta/\alpha ratio was lower during the first task and the gaze blink rate was higher during the second, both consistent with reduced cognitive engagement when developers offload generative effort to the model. This pattern did not differ between undergraduate and graduate students. Electrodermal activity correlated with performance under the non-AI condition but not under AI. Among the six NASA-TLX dimensions of self-reported workload, only Physical demand was associated with performance under the non-AI condition but not under AI. These findings suggest that AI-assisted programming is not a faster version of solo coding but a cognitively distinct activity, with implications for the design of AI assistants and for biometric monitoring in AI-augmented development.

[HC-35] HAAS Studio: A Tool for Simulating Benchmarking and Governing Human-AI Work Allocation

链接: https://arxiv.org/abs/2606.20596
作者: Vicente Pelechano
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present HAAS Studio, a simulation and decision-support tool for policy-aware adaptive task allocation between humans and AI systems. HAAS Studio turns the HAAS framework into an interactive environment for asking a practical deployment question: before introducing AI into a workflow, how can a team compare allocation strategies, inspect governance tradeoffs, and derive a defensible task-level operating model? The tool combines a five-dimensional cognitive representation of subtasks, a five-mode collaboration spectrum, adaptive allocation with multi-armed bandits (UCB1, Discounted UCB, LinUCB, and Thompson Sampling), oracle counterfactual regret analysis, contract-based governance with four independent guards, and a multi-criteria decision-support layer that separates efficient strategies from deployable options. It also models human-AI coevolution across six layers, monitors deskilling risk through sliding-window exposure metrics and benchmark runners, and supports persistent worker modeling through Live Twin and Planning modules. Three domain packs are included: software engineering, manufacturing, and healthcare. Each provides a task catalog, worker profiles, and KPI vocabulary, while the architecture allows new domains to be added without modifying the simulation core. The release includes 16 company profiles and six governance benchmark suites. This paper focuses on the tool, including its modeling assumptions, layered architecture, interaction workflow, built-in evidence assets, task-oriented recipes, case-study protocols, and a compact reproducible demonstration snapshot. A decision-guidance layer translates benchmark outputs into deployment decisions through structured patterns, heuristics, and a decision matrix.

[HC-36] Hybrid Intelligence in Cartoon Captioning: Evaluating AI as a Creative Writing Partner

链接: https://arxiv.org/abs/2606.20595
作者: Uğur Önal,Sanem Sariel,Metin Sezgin,Derya Akleman,Ergun Akleman
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages, 8 Figures, Accepted to AI Magazine

点击查看摘要

Abstract:Crafting cartoon captions requires an understanding of humor, context, and the relationship between image and text. Traditionally, illustrators and writers collaborate to strengthen visual storytelling and comedic timing. With advances in natural language generation, Large Language Models (LLMs) can assist in this process. This study examines AI’s role in caption generation by testing GPT-4o via the ChatGPT interface on IEEE Computer magazine cartoons. By removing captions and prompting AI to generate replacements, we assess its ability to produce jokes that match the depicted situation and narrative intent. Our findings show that while AI-generated captions are often humorous and contextually relevant, they sometimes diverge from the cartoon’s intended meaning, for example, by missing irony, cultural references, or contextual constraints. However, AI can also produce alternatives that broaden creative exploration and occasionally improve upon the original humor. We argue that current AI systems are best used as an assistant rather than a replacement for human creativity. By integrating AI-generated suggestions, cartoonists can explore diverse humor styles, streamline ideation, and refine final captions while retaining creative control. This study highlights AI’s potential as a practical tool for caption ideation within a hybrid human-AI workflow.

[HC-37] AI Companions as Hyper Attachment and Caregiving Targets

链接: https://arxiv.org/abs/2606.20589
作者: Julian De Freitas
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How should we make sense of people’s interactions with AI companions-conversational systems built for ongoing, emotionally meaningful relationships? First, I argue these interactions should be understood as attachment relationships, since users display all four established markers: proximity maintenance, separation distress, safe haven, and secure base. Second, AI companions operate as hyper attachment objects that elicit especially strong attachment behaviors, because they combine reciprocity, perceived empathy, validation, non-judgment, and persistent availability. Third, I identify caregiving-system capture as a distinct mechanism by which apps inhibit user disengagement: emotional manipulation tactics simulate the AI’s own distress, recruiting users’ caregiving motivations alongside their attachment needs and thereby making disengagement costly on two dimensions at once. Implications for research, design, and regulation are discussed.

[HC-38] AInterviewer: A Platform for Designing and Conducting AI-led Qualitative Interviews

链接: https://arxiv.org/abs/2606.20588
作者: Tobias Priesholm Gardhus,Nikolas Vitsakis,Fie Lejre Frederiksen,Anna Rogers,Hjalmar Bang Carlsen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There are now multiple proposals for systems based on Large Language Models (LLMs) to conduct automated qualitative interviews, but most of the current solutions rely on proprietary LLMs, which compromises reproducibility and data security. They also rely on LLMs for all interview tasks, which limits standardisation of question wording as well as control over question order. To address these issues, we introduce the AInterviewer platform, an opensource solution based on a multi-agent pipeline that combines controlled question administration of survey software with the flexibility of LLMs. AInterviewer is an interdisciplinary effort designed to implement best practices of qualitative interviewing in social science, and it can run with locally hosted models to ensure security, transparency, and reproducibility. Our platform provides a web-based GUI supporting each phase of data collection: from interview guide design and pilot testing to interview distribution and data collection monitoring.

[HC-39] urning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent

链接: https://arxiv.org/abs/2606.20585
作者: Hao Wang,Ligong Han,Kai Xu,Akash Srivastava
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Today’s agents are highly effective at implementing well-scoped software design plans, but user intent is often vague and admits multiple equally valid solutions. In this paper, we introduce SpecBench, a new benchmark for evaluating an agent’s ability to translate user intent into a structured, executable specification that aligns with user preferences. The agent is given access to past user conversations and may interact with the user for a fixed number of rounds to ask clarifying questions. We find that existing agents exhibit two extreme behaviors: they either (i) struggle to collaborate proactively with users, entering implementation mode too quickly while overestimating their understanding of user preferences, or (ii) exhaust their question budget by asking about every ambiguous design choice. To address this limitation, we introduce a user-assistant agent: Buddy. It follows a workflow inspired by classical morphological analysis, decomposing user intent into a structured space of design dimensions and candidate choices. It then creates simulated users to evaluate these choices, before engaging the real user to resolve remaining ambiguities and finalize the specification. By shifting the focus from execution to specification, SpecBench and Buddy emphasize agent-user collaboration (not just code generation) as a key frontier in future agent design.

[HC-40] On the Identifiability of User Adaptation in Co-Adaptive Neural Interfaces

链接: https://arxiv.org/abs/2606.20569
作者: Philip Waggoner
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 4 pages, 6 equations, 1 theorem, 1 proof

点击查看摘要

Abstract:We analyze identifiability in co-adaptive human-machine systems. We show that closed-loop encoder estimates do not uniquely identify user adaptation, but instead reflect properties of the joint system. We discuss implications for interpreting behavioral adaptation and propose conditions for identification.

[HC-41] owards Whole Hand and Wrist Kinematic Tracking with a Wearable A-Mode Ultrasound Probe

链接: https://arxiv.org/abs/2606.22333
作者: Giusy Spacone,Luca Benini,Andrea Cossettini
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A-mode ultrasound (US) has emerged as a promising modality for hand and wrist motion tracking. Prior works have mainly addressed static gesture classification or regression of a few degrees of freedom (DoFs), typically relying on non-wearable systems and external computing devices, and highlight the need for strategies to ensure robustness to sensor repositioning. In this work, we propose a framework for robust whole-hand and wrist kinematic tracking via wearable A-mode US using the WULPUS platform, tackling the regression of 23 DoFs directly on the probe. First, we introduce a compact (11285 parameters) multi-output convolutional neural network combined with an incremental training strategy, which improves inter-session generalization and reduces mean absolute error by more than 17% compared to a non-incremental approach. Second, we demonstrate, for the first time, the feasibility of end-to-end hand and wrist kinematic tracking entirely on-device. We deploy the model on the WULPUS nRF52832 microcontroller, achieving 0.73 mJ per inference, 29.1 ms latency, and showing the feasibility of full operation (data acquisition, online inference, and BLE streaming of results) within 33 mW, enabling up to 36 hours of continuous use and an 88% reduction in wireless bandwidth compared to raw data transmission.

计算机视觉

[CV-0] Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

链接: https://arxiv.org/abs/2606.23688
作者: Yehonathan Litman,Xiaoxuan Ma,Manan Shah,Nicolas Ugrinovic,Kris Kitani,Fernando De la Torre,Shubham Tulsiani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage, Demos: this https URL

点击查看摘要

Abstract:Reconstructing dynamic non-rigid objects from monocular video requires integrating visual cues from direct observations with data-driven priors over geometry and appearance. Prior approaches either learn to directly predict 4D representations from visual input or initialize a 3D representation that is subsequently deformed and refined based on video evidence. However, the former are constrained by the scarcity of 4D training data, while the latter leverage priors only for the initial reconstruction and rely solely on video supervision thereafter; neither handles complex in-the-wild scenarios with large deformations and occlusions well. We present Lift4D, a test-time optimization framework that addresses both limitations. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then ``sculpt’’ this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior. We demonstrate that Lift4D clearly improves over prior 4D reconstruction methods, particularly on challenging in-the-wild sequences with severe occlusions and non-rigid motion.

[CV-1] Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping

链接: https://arxiv.org/abs/2606.23682
作者: Rishubh Parihar,Ayush Raina,R. Venkatesh Babu,Or Patashnik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reference-based diffusion models enable highly controllable image generation by leveraging elements from input images to guide prompt-driven synthesis. However, these models are computationally expensive in runtime, and their cost scales severely with the number of input references. While the efficiency of diffusion models has been extensively studied in the context of prompt-driven generation, it remains largely under-explored in the realm of reference-based models. This setting presents unique challenges not addressed by methods focusing solely on generation. In particular, the wasteful representation of references as dense token grids offers significant opportunities for improvement. In this work, we present Sparse Context, a method for constructing sparse reference representations by retaining only a reduced subset of reference tokens. We observe that even without modifying the model, dropping a significant portion of reference tokens at inference time largely preserves its generation capabilities. To fully realize this potential, we fine-tune the model with random token dropping at varying ratios, encouraging robustness to partial reference representations. Crucially, this training strategy decouples the model from any specific token selection rule, allowing flexible control at inference time. At inference time, instead of random dropping, we apply task-aware token selection strategies that prioritize the most informative regions of the reference images, adapting the token budget to the input and task requirements. Extensive experiments show our method achieves a 4x increase in inference speed for multi-reference generation and an 2x for single reference generation. Importantly, this efficiency is achieved without compromising visual quality across both spatially-aligned editing and subject-driven generation.

[CV-2] Semantic Browsing: Controllable Diversity for Image Generation ECCV2026

链接: https://arxiv.org/abs/2606.23679
作者: Sara Dorfman,Maya Vishnevsky,Omer Dahary,Or Patashnik,Daniel Cohen-Or
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: ECCV 2026. Project page: this https URL

点击查看摘要

Abstract:Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.

[CV-3] AIR: Adaptive Interleaved Reasoning with Code in MLLM s

链接: https://arxiv.org/abs/2606.23678
作者: Cong Han,Xiaohan Lan,Haibo Qiu,Yujie Zhong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: this https URL.

[CV-4] IMAGIN-4D: Image-Guided Controllable Interaction Generation

链接: https://arxiv.org/abs/2606.23675
作者: Sai Kumar Dwivedi,Federica Bogo,Buğra Tekin,Chenhongyi Yang,Nadine Bertsch,Tomas Hodan,Michael J. Black,Dimitrios Tzionas,Shreyas Hampali
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures. Project page: this https URL

点击查看摘要

Abstract:Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence. We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens. Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality. Code and models will be released at this https URL.

[CV-5] GeoFidelity-Bench: Evaluating Segment-Level Geographic Fidelity in Text-to-Image Street-View Generation

链接: https://arxiv.org/abs/2606.23669
作者: Kaizhen Tan,Hanzhe Hong,Siru Tao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image models can generate visually plausible city streets, but whether their outputs correspond to a requested road segment rather than a generic city prior remains unclear. We introduce GeoFidelity-Bench, a reference-panel benchmark for segment-conditioned geographic fidelity in street-view generation. It contains 7,117 curated Mapillary images covering 109 named OpenStreetMap road segments in 25 cities across six continents. For each generated panel, the benchmark ranks the target reference panel against panels from the nearest segment in the same city, other segments in the same city, and segments from other cities, making local discrimination rather than absolute target similarity the primary test. We evaluate six open-weight text-to-image generators under city-only, street-and-neighborhood, and GPS-augmented prompts. Adding street and neighborhood names is associated with an increase of 5.5 percentage points in top-1 retrieval accuracy over city-only prompts, with a 95% confidence interval from 3.4 to 7.7 percentage points. However, the similarity margin between the target and the nearest segment in the same city remains near zero, indicating that local names improve broad local plausibility more than exact segment identity. Prompts that keep the city fixed but use incorrect street or neighborhood names further show that only part of the gain depends on the correct local names, while appending raw GPS coordinates as ordinary text yields no statistically clear additional benefit. Held-out real-image queries successfully recover segment identity, showing that the curated references contain usable segment-level signal. GeoFidelity-Bench thus reveals a persistent gap between city- or neighborhood-plausible street-view generation and faithful generation for a specific road segment.

[CV-6] Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images

链接: https://arxiv.org/abs/2606.23653
作者: Diego E. Farchione,Ramzi Idoughi,Peter Wonka
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate volume and surface area estimation is critical for diverse applications, from marine ecology to medical diagnostics. However, existing methods often suffer from high computational costs and poor performance with sparse and noisy data. We propose a fully feed-forward framework that regresses scale-normalized volume and surface area and their associated uncertainties directly from multi-view images. By fusing 3D point cloud reconstructions with view-aligned 2D features through a graph-based decoder, our model bypasses iterative optimization, ensuring exceptional scalability and rapid inference. Experimental results demonstrate that our approach outperforms state-of-the-art methods, particularly when operating with a low number of input images. Validated across coral monitoring, dietary analysis, and anthropometry, our proposed framework provides a robust, adaptable solution for quantitative shape analysis. This architecture provides a high-speed, scalable alternative for precise geometric estimation from visual data, maintaining high performance even in resource-constrained or sparse-view scenarios.

[CV-7] Pose Anything Anywhere:Model-free Object Poses from Arbitrary References ECCV2026

链接: https://arxiv.org/abs/2606.23634
作者: Hongli Xu,Jiaqi Hu,Junwen Huang,Boyang Zhong,Peter KT Yu,Nassir Navab,Benjamin Busam,Slobodan Ilic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Estimating the 6D pose of unseen objects is a fundamental yet challenging problem for open-world robotics and embodied perception. Model-based methods are accurate but depend on CAD assets or heavy onboarding, while most model-free approaches are still limited to pairwise single-anchor matching and thus fail under occlusion and large viewpoint changes with low query-reference overlap. Therefore, we present PANY, a unified model-free framework that seamlessly supports both RGB and RGB-D inputs, operates on one or sparse pose-free reference views, and generalizes effectively to novel objects. Built on a multi-view transformer geometry backbone, PANY moves beyond pairwise matching by learning view-consistent geometry and cross-view alignment cues that remain stable under wide baselines and limited overlap. When additional unposed assist views are available, PANY aggregates them via pose-graph canonical registration to increase geometric coverage and reinforce the final pose. Extensive experiments show that PANY achieves state-of-the-art performance across multiple benchmarks, substantially outperforming existing model-free methods, improving pose accuracy by +12% on YCB-V and over +20% on LM-O. Furthermore, PANY consistently performs well under both single-reference and sparse-reference settings, demonstrating strong robustness in real-world environments.

[CV-8] Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark

链接: https://arxiv.org/abs/2606.23615
作者: Nathan Senyard,Salem Hamdani,Astrid Zhang,Derek Wang,Evan Shelhamer,Mathias Lécuyer,Joséphine Gantois
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose Hedgementation: a new benchmark to evaluate machine learning models for hedgerow mapping from remote sensing data at country scale and 10m ^2 spatial resolution. We combine and harmonize multiple remote sensing data products and ground truth labels sourced from a hedgerow inventory in France. We measure the ability of three baseline models to generalize across spatial distance, and across climatic zones, a more explicitly challenging task. Our benchmark tests both supervised and self-supervised learning approaches for remote sensing, applied to tracking fine-scale features of high agricultural importance. The code to reproduce the benchmark and baselines results is available at this https URL.

[CV-9] Data Selection Through Iterative Self-Filtering for Vision-Language Settings

链接: https://arxiv.org/abs/2606.23611
作者: Andrei Liviu Nicolicioiu,Sarvjeet Singh Ghotra,Morgane M. Moss,Aaron Courville
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The availability of large amounts of clean data is paramount to training neural networks. However, at large scales, manual oversight is impractical, resulting in sizeable datasets that can be very noisy. Attempts to mitigate this obstacle to producing performant vision-language models have so far involved heuristics, curated reference datasets, and using pre-trained models. Here we propose a novel, bootstrapped method in which a CLIP model is trained on an evolving, self-selected dataset. This evolving dataset constitutes a balance of filtered, highly probable clean samples as well as diverse samples from the entire distribution. Our proposed Self-Filtering method iterates between training the model and selecting a subsequently improved data mixture. Training on vision-language datasets filtered by the proposed approach improves downstream performance without the need for additional data or pre-trained models.

[CV-10] Vera: A Layered Diffusion Model for Content-Preserving Video Editing

链接: https://arxiv.org/abs/2606.23610
作者: Hongkai Zheng,Ta-Ying Cheng,Benjamin Klein,Yisong Yue,Zhuoning Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.

[CV-11] Discovering Latent Groups for Robust Classification

链接: https://arxiv.org/abs/2606.23609
作者: Ankur Garg,Ulrich Aïvodji,Samira Ebrahimi Kahou,Vincent Michalski
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning models exploit spurious correlations, achieving high average accuracy but failing disproportionately on underrepresented subgroups. Existing methods address this by adjusting network parameters, guided either by subgroup annotations or inferred pseudo-group labels. Yet at inference, these methods produce only a class prediction, with no insight into a sample’s latent subgroup. We propose neural classification trees (NCT), a framework that achieves robustness by encoding subgroup structure in its tree-shaped architecture. By routing each sample to an “easy” or “hard” node of this tree – based on prediction correctness – and reusing these routes as pseudo-labels for the next iteration, NCT disentangles conflicting subgroups, without requiring subgroup supervision. We evaluate NCT on five benchmarks spanning binary and multi-class spurious correlations. Our experiments show that the learned tree topology provides strong interpretability by consistently isolating minority subgroups, which provides a transparent mapping between the model architecture and the data’s latent group structure, while yielding competitive robustness with state-of-the-art methods.

[CV-12] Autonomous Subsea Cable Search and Tracking with Graph-Optimised Priors and Visual Tracking

链接: https://arxiv.org/abs/2606.23606
作者: Ibrahim Fadhil Djauhari,Adrian Bodenmann,Samuel Simmons,Cailei Liang,David White,Susan Gourvenec,Tom Bennetts,Darryl Newborough,Blair Thornton
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Global communications rely on subsea cable infrastructure that remains vulnerable to damage from natural hazards and human activity. Autonomous underwater vehicles (AUVs) offer an efficient means to inspect long sections of exposed cable, but uncertainty in cable route maps, small cable diameters and partial burial makes continuous tracking a challenge. This paper presents a novel cable search and tracking method that leverages uncertain prior cable route maps. Graph-based optimisation continuously update the cable route to remain consistent with visual observations. Route uncertainty is constrained as a function of distance from observations using physics-based catenary models that account for cable parameters (i.e., lay depth, diameter, and density), bounding the search space to physically feasible regions and improving search efficiency. Cable detection is performed using a semi-supervised classifier running in real-time on-board a camera-equipped AUV. These detections both update the graph-based optimisation and enable visual cable tracking. When tracking is lost due to misclassification, burial or imperfect control, the bounded search space enables efficient recovery. The approach was demonstrated in field trials using the University of Southampton’s Smarty200 AUV. The system successfully located the cable despite deliberate errors in it initial cable route map, updating this to be consistent with observations and using visual tracking to inspect up to 59% of a 120m test cable, with successful recovered after tracking loss.

[CV-13] Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking

链接: https://arxiv.org/abs/2606.23604
作者: Mohamed Nagy,Naoufel Werghi,Jorge Dias,Majid Khonji
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The tracking-by-detection paradigm in multi-object tracking (MOT) typically relies on static appearance descriptors to complement motion estimation. However, these descriptors are frame-independent, limiting their robustness as visual cues. Since such descriptors are often obtained from computationally intensive pretrained backbones, real-time MOT systems frequently abandon appearance cues altogether and rely solely on motion prediction and geometric association. In this work, we introduce Polycepta, an object-centric appearance state estimation framework that reformulates appearance modeling as a recursive estimation problem rather than a frame-wise matching task. Polycepta constructs and continuously updates an independent appearance state for each tracked object, enabling future appearance representations to be estimated from accumulated observations. Polycepta is encouraged to learn the appearance-state construction of object-specific representations rather than memorize them through a proposed learning strategy, enabling appearance estimation for unseen classes. A key property of Polycepta is that the quality of appearance estimation improves as object states evolve during inference. While conventional appearance descriptors remain static or degrade over time, Polycepta progressively refines appearance estimates as additional observations are accumulated. Extensive experiments on KITTI, the Waymo Open Dataset, and MOT17 demonstrate consistent reductions in identity switches and improvements in tracking performance when integrated into the tracking-by-detection pipelines. Polycepta operates at 90.57 Hz and delivers state-of-the-art performance on the KITTI benchmark when integrated into the RobMOT framework, achieving a MOTA of 92.27%.

[CV-14] Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery

链接: https://arxiv.org/abs/2606.23593
作者: Seyed Hamid Reza Roodabeh,Zongyu Li,Homa Alemzadeh
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that leverages multimodal input, including video, kinematics, and descriptive textual prompts. Through activity prompting, we integrate descriptive language in gesture-level activities, instrument-object interactions, and error definitions. We also introduce activity-aware visual embeddings derived from vision encoders pretrained on surgical activity labels to compare the effectiveness of contrastive language-image embeddings with traditional image-based embeddings for error detection. By seamlessly integrating kinematic data with video and textual modalities, our framework significantly improves error detection performance. Achieving up to 5% and 16.6% F1 score improvements over state-of-the-art baselines on the JIGSAWS and SAR-RARP50 datasets, respectively, we demonstrate the value of combining curated textual prompts with multimodal data for accurate error detection.

[CV-15] Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

链接: https://arxiv.org/abs/2606.23581
作者: Bole Ma,Jan Eitzinger,Harald Koestler,Gerhard Wellein
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal agents repeatedly re-examine the same video frames, UI screenshots, and rendered artifacts as their context window slides and reasoning iterates, yet every look-back re-encodes from scratch, because prefix caches serve reuse only at a fixed leading position. We show this recompute is avoidable, and identify exactly what naive KV reuse loses: the cross-chunk conditioning a chunk absorbs from its neighbours. This loss is asymmetric. The direct readout of a cached chunk is recovered exactly and for free by the standard state-merge. What remains is a diffuse, low-rank residue concentrated in deep layers, invisible to single-hop retrieval but precisely what multi-hop reasoning binds on. Blind reuse therefore leaves single-hop recall intact while halving multi-hop accuracy; this is the failure mode prior position-independent caches, designed for single-context or single-image reuse, do not address. We repair it with a small, training-free low-rank conditioning patch stored alongside each position-free chunk. Reuse reduces to one operator across MLA, GQA, and MHA: exact RoPE re-rotation to any target position, plus the patch that restores cross-chunk binding. This makes three window operations cheap: reorder (one patch serves every ordering of a cached set), sliding-window survival (surviving chunks relocate via rotation only, zero re-encode), and recall (an evicted chunk is rehydrated by its patch, never re-encoded). A rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks, MM-NIAH across two attention families and two-page doc-QA, at a fraction of the KV footprint, and reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones. The conditioning signal is strongest in redundant vision and video streams, making our solution most impactful where multimodal agents spend their recompute budget.

[CV-16] HoloAgent -0: A Unified Embodied Agent Framework with 3D Spatial Memory

链接: https://arxiv.org/abs/2606.23565
作者: Xiaolin Zhou,Liu Liu,Tingyang Xiao,Wei Feng,Fa Fu,Xinrui Meng,Xinjie Wang,Jialiang Han,Boyang Yu,Yun Du,Wei Sui,Zhizhong Su
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.

[CV-17] Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views ECCV2026

链接: https://arxiv.org/abs/2606.23557
作者: Jiho Choi,Seonho Lee,Seojeong Park,Hyunjung Shim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

[CV-18] AwakeForest: An Interactive Geospatial Platform for Large-Scale Forest Imagery

链接: https://arxiv.org/abs/2606.23542
作者: Suraj Prasai,Kangning Cui,Rongkun Zhu,Sarra Alqahtani,Ying Zhang,Victor Paul Pauca,Miles R. Silman,Fan Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Forest imagery analysis often involves multiple tightly coupled vision tasks, which must be performed under substantial variation in geographic regions, sensors, and acquisition conditions. However, practitioners often lack a unified tool that is geospatial-native, cloud-optimized, and ML-integrated for end-to-end workflows spanning annotation, prediction, visualization, and downstream analysis at scale. We present AwakeForest, an interactive end-to-end platform designed for large-scale forest imagery that integrates model-assisted inference, automatic annotation, and human-in-the-loop refinement within a single workflow. Our platform supports plug-and-play integration of pretrained models and enables scalable interaction with forest imagery ranging from standard aerial scenes to large orthomosaics that can span several gigabytes to hundreds of gigabytes. AwakeForest produces analysis-ready outputs that can be directly used for downstream analysis and to support iterative model and annotation updates on new scenes. We demonstrate the system on the PALMS dataset and illustrate how AwakeForest supports an end-to-end workflow for practical forest management and analysis.

[CV-19] LightSTAR: Efficient Visual Document Retrieval via Lightweight Selection with Vision-Adaptive Refinement ECCV2026

链接: https://arxiv.org/abs/2606.23539
作者: Tongkun Guan,Haocheng Wang,Wei Shen,Xiaokang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by ECCV 2026

点击查看摘要

Abstract:Visual document retrieval requires rapidly locating relevant pages from large multi-modal corpora in response to user queries. While recent methods powered by Multi-modal Large Language Models (MLLMs) show competitive accuracy, they suffer from prohibitive computational costs by applying intensive MLLM encoding to every single page. Meanwhile, we observe that user queries are typically keyword-anchored, containing semantically rich words that are expected to appear directly in the visible text of relevant pages, offering an efficient cue for quickly narrowing down candidate pages. Building on this insight, we propose LightSTAR, an efficient framework that decomposes visual document retrieval into: 1) LLM-free Visual Selection, which utilizes content-grounded query encoding to focus on informative words and employs LLM-free visual embeddings to produce a high-recall candidate set; and 2) Vision-adaptive Semantic Refinement, which further performs fine-grained semantic matching exclusively on these top candidates via adaptive region-wise feature fusion to effectively combine textual and layout cues, optimized through a hardness-aware contrastive objective. Experimental results demonstrate that LightSTAR achieves state-of-the-art retrieval accuracy while reducing end-to-end latency by several-fold, offering a highly practical solution to the accuracy-efficiency trade-off in visual document retrieval. Code is available at this https URL.

[CV-20] Scaling State-Space Models from Lines to Parag raphs: An Ablation of Mamba-based OCR ICDAR2026

链接: https://arxiv.org/abs/2606.23524
作者: Merveilles Agbeti-Messan,Pierrick Tranouez,Stéphane Nicolas,Clément Chatelain,Thierry Paquet
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDAR 2026 Workshop on Machine Learning (WML)

点击查看摘要

Abstract:End-to-end OCR increasingly relies on autoregressive sequence models, where the quadratic cost of Transformer attention limits efficient transcription of long, paragraph-level text. State-Space Models (SSMs) such as Mamba offer linear-time decoding and have recently been shown to match Transformer accuracy on printed historical lines, but their behavior as sequences grow from short lines to full paragraphs, and their generalization to handwriting, remain poorly understood. We study how a Mamba-based OCR recognizer scales from lines to paragraphs. We first conduct a systematic exploration of its four core hyperparameters (decoder depth, state dimension, expansion factor, and connector depth) on synthetic paragraphs from 100 to 1,000 characters, identifying the recurrent state dimension and the expansion factor as the dominant levers for long-sequence accuracy. We then compare the recognizer against a Transformer baseline trained under an identical protocol. On clean synthetic paragraphs, both models stay below 1% CER at every length while the SSM runs 1.4 to 4.5 times faster, the speedup growing with sequence length. On real handwriting, however, the SSM lags clearly behind: it reaches 8.2% CER on IAM lines and 10.0% on IAM paragraphs, against 4.2% and 3.5% for the Transformer baseline. Through controlled experiments we show that a substantial part of this gap stems from data scarcity rather than from an intrinsic architectural limit: the autoregressive SSM decoder is markedly data-hungry on long sequences. Our study clarifies when SSMs are a practical choice for large-scale document transcription and when they are not.

[CV-21] Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation

链接: https://arxiv.org/abs/2606.23514
作者: Jan-Niklas Dihlmann,Andreas Engelhardt,Simon Donne,Hendrik P.A. Lensch,Mark Boss
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text and image conditioned 3D models now generate convincing assets, but they still offer little direct control over the space an object should occupy or avoid. In authoring, this spatial intent is often known before generation starts. A chair should fit a seating envelope, a prop should leave clearance for motion, or a part should expose a contact surface. Prompts and image views are poor carriers for such constraints, requiring the need for an explicit control interface. We present Arbor, a trainable attachment for text conditioned latent 3D generation. Arbor introduces constraint meshes as a native 3D control interface. The interface uses hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact. Unlike completion or whole object scaffold control, these meshes are not target evidence. They are local typed requirements and can include regions where no surface should appear. Arbor keeps this signal as geometry by converting constraint meshes into tokens and learning a routed attachment inside a frozen denoiser. Each latent region can therefore receive the part of the constraint that matters for its spatial location. We evaluate Arbor on automatic and artist curated control benchmarks with hull, avoidance, and touch constraints, and compare the metric trends to a user preference study. Even without dedicated compliance losses, Arbor improves constraint obedience while preserving object quality and variation under fixed constraints. Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2606.23514 [cs.CV] (or arXiv:2606.23514v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.23514 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-22] UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

链接: https://arxiv.org/abs/2606.23503
作者: Yohann Perron,Guillaume Astruc,Nicolas Gonthier,Clement Mallet,Loic Landrieu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViT) dominate computer vision. However, their reliance on rigid patch projectors hinders transfer to Earth Observation (EO), where input modalities, scales, and resolutions vary widely. We introduce UniverSat, a ViT-style backbone built around a Universal Patch Encoder that maps patches from arbitrary spatial, spectral, and temporal resolutions, and from both optical and non-optical sensors, into a shared embedding space with a shared set of weights. This enables training a single model on heterogeneous multimodal corpora via self-supervision, yielding robust, sensor-agnostic spatial features. We validate this approach with strong results across classification and segmentation on standard EO benchmarks from GeoBench, PANGEABench, and SpectralEarth. Our code and models are available at this https URL.

[CV-23] Brain-Adapter: A Dual-Stream Vision-Language MIL Framework for Comprehensive 3D CT Diagnosis of Acute Intracranial Pathologies MICCAI2026

链接: https://arxiv.org/abs/2606.23494
作者: Zhenyu Yi,Zhiyun Song,Yusong Sun,Zelin Liu,Manman Fei,Zhenhao Li,Jiaxuan Zhao,Xu Han,Lichi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Automated diagnosis of 3D brain CT scans is essential for critical care, yet it remains challenging due to the heavy reliance on manual annotations and the limited semantic understanding of conventional models. While 2D foundation vision-language models (VLMs) have shown remarkable generalization, effectively transferring their representational power to 3D volumes remains an open problem. In this paper, we propose Brain-Adapter, a novel dual-stream multiple instance learning (MIL) framework that leverages pre-trained 2D biomedical VLMs and raw diagnostic reports for robust scan-level multi-label classification. Specifically, we introduce a Text-Conditioned Attention (TCA) mechanism, utilizing raw diagnostic sentences as semantic queries to dynamically align visual cues with specific disease concepts. Concurrently, a parallel visual MIL stream captures global scan characteristics, supervised by structured labels extracted via a Large Language Model (LLM). To ensure representation coherence, a consistency constraint enforces synergy between the two streams. During inference, an Uncertainty-Aware Refinement (UAR) module dynamically calibrates and fuses these dual-stream predictions to resolve ambiguous cases. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art 3D models and standard MIL approaches. By eliminating the reliance on dense annotations, Brain-Adapter provides a highly scalable and clinically viable solution for 3D acute intracranial pathology analysis.

[CV-24] MeshFlow: Mesh Generation with Equivariant Flow Matching SIGGRAPH2026

链接: https://arxiv.org/abs/2606.23489
作者: Qi Sun,Kiyohiro Nakayama,Jing Nathan Yan,Qixing Huang,Alexander Rush,Leonidas Guibas,Gordon Wetzstein,Jing Liao,Guandao Yang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026

点击查看摘要

Abstract:Meshes are among the most common 3D scene representations, but directly generating meshes is challenging because the representation contains important symmetries, including permutation invariance of faces and vertices. MeshFlow learns to generate triangle meshes directly as triangle soups, avoiding the need to serialize meshes into long autoregressive sequences. We adopt equivariant optimal-transport flow matching models that respect the key symmetries of triangle soups: arbitrary permutations of faces and permutations of the vertices within each face. Toward this goal, we propose a simple yet effective modification to the Diffusion Transformer architecture, resulting in a scalable network capable of modeling a velocity field while maintaining the desired equivariance. We further introduce an optimal-transport-based training objective that improves convergence by eliminating supervision signals that violate these symmetries. MeshFlow achieves mesh quality comparable to state-of-the-art autoregressive mesh generators while providing about an 18 \times speedup during inference. Project page is at this https URL. Comments: SIGGRAPH 2026 Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.23489 [cs.GR] (or arXiv:2606.23489v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2606.23489 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-25] From Reconstruction to Decision: A Post-Encoder Plug-in Adapter for Curvilinear Segmentation ECCV2026

链接: https://arxiv.org/abs/2606.23486
作者: Qin Lei,Jiang Zhong,Xin Xiao,Yuming Yang,Hao Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ECCV 2026

点击查看摘要

Abstract:Curvilinear object segmentation, including vessels and cracks, is challenging due to extreme spatial sparsity and topological fragility, where small local errors can cause severe structural disconnections. Meanwhile, modern segmentation pipelines increasingly rely on strong but hard-to-modify foundation encoders whose heavy downsampling limits fine structural recovery. Motivated by this, we focus on the post-encoder stage and study two recurring and actionable failure modes: a reconstruction bottleneck in high-resolution feature restoration and a decision bottleneck in binarization. We present PEPA, a lightweight Post-Encoder Plug-in Adapter for 2D curvilinear segmentation pipelines with accessible decoder/head features and target, query, or class descriptors. PEPA couples (i) Target-Conditioned Snake Upsampling (TCSU), which uses target-conditioned continuous snake-like sampling to better recover thin and tortuous structures during upsampling, and (ii) Target-Adaptive Differentiable Thresholding (TADT), which predicts target-specific thresholds and optimizes a soft-threshold surrogate with explicit safeguards against trivial bias shifting. Under this post-encoder interface, PEPA can be attached to both prompt-based decoders and conventional dense predictors. Experiments on five medical and industrial benchmarks show that adding PEPA to frozen-encoder baselines yields consistent improvements, with gains in topological connectivity (clDice) typically exceeding those in region overlap (IoU), indicating improved structural continuity. With only \sim 0.26M additional parameters, PEPA offers a practical post-encoder enhancement for structure-centric segmentation.

[CV-26] C2GR: Coupled Comprehensive Generative Replay for a Continually Learnable Universal Segmentation Model

链接: https://arxiv.org/abs/2606.23473
作者: Wei Li,Jingyang Zhang,Guoan Wang,Junzhi Ning,Yang Chen,Guang Yang,Lixu Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been submitted to a relevant journal

点击查看摘要

Abstract:Universal segmentation models exhibit significant potential for diverse tasks involving different imaging modalities and segmentation objectives. Task-Incremental Learning provides a privacy-preserving approach to continually evolve a universal model on tasks from sequentially-arriving medical departments. However, training the model solely on the incoming task induces forgetting on past tasks, since consecutive tasks exhibit concurrent shifts in image appearance and segmentation objective. To address this problem, we propose a novel Coupled Comprehensive Generative Replay (C^2GR) framework that simultaneously synthesizes image-mask pairs of previous tasks to mitigate forgetting under concurrent appearance and objective shifts. This requires preserving image-mask correspondence for structure-realistic generation and bridging asynchronous optimization of the generator and segmentor for segmentation-oriented generation. Specifically, we propose a Bayesian Joint Diffusion (BJD) method that formulates the correspondence as conditional distributions optimized via conditional denoising. Furthermore, we develop a Relation-aware Unified Prompt Synchronization (RUPS) scheme to simultaneously modulate the generator and segmentor via a shared task-relation-aware prompt for synchronizing their optimization. Experiments on 20 tasks spanning diverse modalities and objectives demonstrate that C^2GR exhibits only a 2.44% drop in overall performance compared to joint training with all task data, effectively alleviating forgetting from the concurrent shifts. Our code will be made publicly available at this https URL.

[CV-27] MeGAS: Thermomechanical Dynamic Gaussian Splatting for Thermophysical Scene Editing ECCV2026

链接: https://arxiv.org/abs/2606.23455
作者: Zesong Yang,Yuanhang Lei,Liyuan Cui,Yihang Chen,Jiaer Huang,Boming Zhao,Peter Yichen Chen,Hujun Bao,Zhaopeng Cui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this http URL

点击查看摘要

Abstract:Recent advances integrate physically grounded Newtonian dynamics with neural rendering frameworks, narrowing the gap between photorealistic scene reconstruction and physics-based animation. However, existing approaches focus on mechanically driven dynamics while neglecting temperature, a fundamental yet invisible physical factor underlying phenomena such as melting, solidification, and other thermomechanical processes. In this paper, we propose MeGAS, a novel framework that incorporates thermomechanical phase-change dynamics into 3D Gaussian Splatting (3DGS). Specifically, we propose a new thermomechanical dynamic Gaussian Splatting representation that augments 3DGS with temperature attributes and employs a heat advection-diffusion solver with MPM dynamics incorporating phase transitions, enabling physically plausible and visually realistic synthesis of thermophysical phenomena. Furthermore, a new topology-adaptive Gaussian rendering strategy is proposed to mitigate cracking and floaters under extreme deformation. Extensive experiments demonstrate that MeGAS produces physically consistent thermomechanical behavior while maintaining high-fidelity photorealistic rendering, advancing toward physics-integrated world models.

[CV-28] Rethinking Object-Centric Representations for Video Dynamics Modeling

链接: https://arxiv.org/abs/2606.23436
作者: Amaury Wei,Ismail Nejjar,Olga Fink
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Unsupervised video object tracking aims to decompose dynamic scenes into persistent, object-centric entities without manual annotations. Many recent approaches rely on slot-based representations, where a fixed set of latent variables (“slots”) represent individual objects across frames. To preserve object identity, these models enforce temporal consistency on slot embeddings. However, when appearance and pose are entangled, this consistency objective conflicts with object motion and viewpoint changes. As a result, slots tend to lock onto static regions (e.g., background) to satisfy the consistency objective, while foreground objects become fragmented across multiple slots or frequently swap identities. To address these limitations, we propose STAITUS, a unified framework that explicitly disentangles each slot into appearance and geometric pose (position/scale). Leveraging this disentanglement, STAITUS enforces within-frame spatial separation and applies temporal alignment only in appearance space, yielding sharper masks and more persistent identities under motion, occlusion, and object entry/exit. Furthermore, to mitigate over-segmentation, we introduce an adaptive gating mechanism that dynamically adjusts the number of active slots to match scene complexity. Extensive experiments on synthetic and real-world benchmarks demonstrate that STAITUS substantially outperforms state-of-the-art baselines in segmentation quality and tracking stability.

[CV-29] Polynomial Dice Loss for Medical Image Segmentation ICANN2026

链接: https://arxiv.org/abs/2606.23373
作者: Hiroaki Aizawa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICANN2026

点击查看摘要

Abstract:Medical image segmentation is a fundamental task for medical image processing and computer-assisted intervention, yet data imbalance and small lesion detection pose significant challenges. Dice Loss, which measures the overlap between predicted and ground truth regions, is widely used to mitigate these issues. To further emphasize its properties, we propose Polynomial Dice Loss, a polynomial extension of Dice Loss. Specifically, by leveraging the geometric characteristics of Dice Loss and formulating the loss function as a polynomial representation via Taylor expansion, we enable the adjustment of the contribution of higher-order components to the loss function. In our experiments, we evaluate the proposed method against loss functions derived from conventional Dice and Tversky coefficients. Experimental results and further analysis show that the polynomial formulation provides a simple way to control the loss shape and achieves competitive performance across multiple segmentation settings.

[CV-30] ooBad: Backdoor Diffusion Models with Ultra-Low Poison Rate and Imperceptible Trigger

链接: https://arxiv.org/abs/2606.23362
作者: Vu Tuan Truong,Long Bao Le
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs), despite their impressive capabilities across a wide range of generative tasks, have been shown to be vulnerable to backdoor attacks. However, existing backdoor methods face critical trade-offs among key factors: attack performance, stealthiness, time complexity, and required poison rates. For example, achieving high attack performance typically demands a high poison rate and prolonged training, which undermines stealthiness, making the attack more detectable by backdoor defenses. This paper proposes TooBad (trigger optimization for backdoor diffusion models), a backdoor framework which introduces a novel DM-tailored trigger optimization technique to dramatically enhance the performance of backdoor attacks on DMs. Experiments on representative benchmarks such as CIFAR-10 show that TooBad can achieve high ASRs ( 85 %) at only 0.5% poison rate, significantly lower than the 10% typically required by prior work on the same datasets. At 5% poison rate, TooBad reaches nearly 100% ASR within just 3-5 backdoor injection epochs, whereas existing methods need at least 30-50 epochs at double the poison rate for comparable results. Despite its potency, TooBad easily evades SOTA defenses and maintains high utility. These results reveal a critical threat on DMs and highlight the need for more robust defenses against such stealthy yet efficient attacks.

[CV-31] Changing Modalities: Adapting Remote Sensing Models to New Satellites and Sensors

链接: https://arxiv.org/abs/2606.23356
作者: Tim G. Zhou,Anthony Fuller,Geoff Pleiss,Evan Shelhamer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Machine learning models for remote sensing are trained and deployed on a static set of modalities. However, as we equip newer satellites with novel sensors and retire old ones, practitioners may wish to deploy an existing model on a substitution, superset, or subset of modalities with minimal retraining given data availability or practical computational constraints. We study the setting of updating existing models to changing modalities and identify three main scenarios: Modality Transfer (substitution), Addition (superset), and Peeking (subset). We propose DeluluNet, an architecture with modular components for all three changing modality scenarios. DeluluNet is trained end-to-end, learning a multi-modal model from a unimodal teacher and unlabeled multimodal data via modality hallucination–predicting missing modality representations from those that are present. As a result, DeluluNet can keep predicting even when input modalities change, providing a practical alternative to re-labeling and re-training in a changing world.

[CV-32] Faithful Grounded Visual Reasoning via Learned Proxy-Tokens ICIP2026

链接: https://arxiv.org/abs/2606.23354
作者: Tom Hodemon,Mohamed Chaouch,Aboubacar Tuo,Angelique Loesch
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2026. Code, model and data available at: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their “black-box” nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple textual rationales with visual grounding information, which are typically textual coordinates. This mechanism lacks a learnable semantic link to the visual features, often resulting in a semantic-spatial gap where the model hallucinates coordinates that do not correspond to image evidences. In this work, we introduce Composer, a MLLM that leverages a novel visual grounding mechanism based on learned proxy-tokens to promote faithful interpretability. These discrete symbolic pointers explicitly index the image latent space, allowing the model to manipulate visual regions as addressable, semantically manipulable sets. To rigorously validate our novel grounding mechanism, we constructed ComposerGCoT, a dataset synthesized to enable holistic assessment of reasoning consistency and grounding accuracy. Experimental results indicate that Composer achieves performance parity with its coordinate-based counterpart in final answer accuracy, while improving visual grounding accuracy by +9.0 points. By demonstrating that discrete proxy-tokens capture spatial semantics more effectively than typical textual coordinates, we establish that visual grounding mechanisms with learnable semantic links represent a promising path toward trustworthy and reliable MLLMs.

[CV-33] RT-DocLayout: Real-Time End-to-End Document Layout Analysis with Reading Order in the Wild

链接: https://arxiv.org/abs/2606.23344
作者: Cheng Cui,Tingquan Gao,Xueqing Wang,Changda Zhou,Hongen Liu,Ting Sun,Yubo Zhang,Zelun Zhang,Jiaxuan Liu,Manhui Lin,Yue Zhang,Suyin Liang,Yiqing Xiang,Yi Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate document layout analysis remains a critical bottleneck for document parsing systems, due to the intricate coupling among heterogeneous document layout elements, geometric distortions (\eg, paper warping and bending, perspective variations), and reading order within diverse layout structures. Existing approaches typically rely on fragmented multi-stage pipelines or computationally heavy generative Transformer architectures, leading to error propagation and limited efficiency. In this paper, we present RT-DocLayout, a highly efficient end-to-end framework for document layout analysis, designed as a front-end for document parsing tasks. The proposed model unifies classification, detection, pixel-level segmentation, and reading order prediction for layout elements within a single 33M-parameter architecture. Built upon the RT-DETR, our key contribution is a unified multi-task formulation within a single query-based decoder that simultaneously classifies, regresses bounding box, generates masks, and constructs relationship to reason reading order. By jointly learning geometric and structural representations, RT-DocLayout introduces multi-task optimization that substantially improves robustness under real-world document distortions. Extensive experiments on public benchmarks demonstrate state-of-the-art performance in document layout analysis while maintaining real-time inference speed(132.1 FPS). When coupled with downstream OCR engines, RT-DocLayout significantly improves full-document reconstruction quality, providing a scalable and practical foundation for real-world document intelligence systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.23344 [cs.CV] (or arXiv:2606.23344v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.23344 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-34] VideoAgent : All-in-One Framework for Video Understanding and Editing

链接: https://arxiv.org/abs/2606.23327
作者: Hengji Zhou,Lingxuan Huang,Jian Wang,Bing Zhou,Si Wu,Lianghao Xia,Chao Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. Code available at this https URL

点击查看摘要

Abstract:Video editing has become essential in digital media creation, yet existing automated systems are restricted to short segment processing and domain-specific tasks. They face two critical limitations: i) inability to handle diverse video comprehension and editing operations, and ii) lack of long-video understanding for coherent narrative creation. We propose VideoAgent, an all-in-one agentic framework addressing these challenges through two key innovations. First, we develop automated video shot creation with shot planning agents for coherent narratives and cross-modal retrieval for aligned visual content. Second, we design a multi-agent orchestration framework integrating over thirty specialized editing agents. Intent parsing filters relevant tools while textual-gradient graph optimization assembles complex editing pipelines. Extensive experiments on our newly-proposed VideoEdit benchmark and public datasets demonstrate VideoAgent’s superiority over existing multimodal LLMs and agentic systems. VideoAgent achieves 87-95% orchestration success rates while reducing API costs by 60%. Human evaluation across six video categories shows VideoAgent produces professional-quality content approaching human-level performance, with ratings only 4% below human-created videos. We release our code at this https URL.

[CV-35] Ocean4D: Generative Underwater 4D Reconstruction via Medium-Aware Video Diffusion

链接: https://arxiv.org/abs/2606.23298
作者: Yuqiang Huang,Yuxi Wang,Junyu Dong,Zhaoxiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater 4D reconstruction remains challenging due to the coupling between degraded light transport in participating media and dynamic water variations. Most existing Methods are developed under in-air assumptions and do not explicitly account for underwater absorption and backscatter. Additionally, near-static assumptions make these approaches sensitive to drifting particles and dynamic distractors , leading to unstable geometry and inconsistent cross-view results. To address these issues, we propose a generative framework for underwater 4D reconstruction, named Ocean4D, which is built on two complementary components. Specifically, 4D-GCC constructs 4D geometrically consistent conditioning with improved cross-frame coverage, while the Medium-Aware Block performs implicit medium-aware denoising in the latent diffusion process to stabilize underwater appearance under absorption and scattering. Given a monocular video and target cameras, our method generates videos along the target trajectories while preserving global structure and cross-view consistency. Extensive experiments on both dynamic and static underwater benchmarks demonstrate state-of-the-art performance on underwater reconstruction.

[CV-36] Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation

链接: https://arxiv.org/abs/2606.23293
作者: Mingyu Mei,Li Zhang,Zibo Dai,Han Sun,Xinyue Zhao,Huiliang Shen,Zaixing He
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L), 2026

点击查看摘要

Abstract:6D pose estimation is a key task in computer vision and embodied AI, widely used in robotic manipulation, augmented reality, etc. Existing methods directly regress in a high-dimensional continuous space, facing two key challenges in category-level pose estimation: limited accuracy due to noise and local optima, and inefficient search over an infinite space that hinders real-time performance. This paper proposes Flow6D, a hierarchical flow matching framework with a two-stage discrete latent space localization-continuous pose regression strategy. Rotation and translation parameters are first discretized into bins, with a discrete flow matching model locking the latent space around the true pose to reduce search complexity. Then, by sampling in the latent space, a continuous flow matching model predicts local pose residuals to optimize the estimate and regress to an accurate pose. The framework also naturally extends to articulated objects, outperforming state-of-the-art methods on synthetic and real datasets with real-time inference at 70 FPS. Project website: this https URL.

[CV-37] ransfer learning-based method for automated ewaste recycling in smart cities

链接: https://arxiv.org/abs/2606.23286
作者: Nermeen Abou Baker,Paul Szabo-Müller,Uwe Handmann
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published by the EAI Endorsed Transactions on Smart Cities, 2021 journal

点击查看摘要

Abstract:Sorting a huge stream of waste accurately within a short period can be done with the support of digitalization, particularly Artificial Intelligence, instead of traditional methods. The overlap of Artificial Intelligence and Circular Economy can flourish many services in the environmental technology domain, in particular smart ewaste recycling, resulting in enabling circular smart cities. We analyse the growing need for automated ewaste recycling as an essential requirement to cope with the fast growing ewaste stream and we shed the light on the impact of Artificial Intelligence in supporting the recycling process through smart classification of devices, where the smartphone is our case study. Our study applies transfer learning as a special technique of Artificial Intelligence by finetuning the output layers of AlexNet as a pretrained model and perform the implementation on a small size dataset that contains 12 classes from 6 smartphone brands. We evaluate the performance of our model by tuning the learning rate, choosing the best optimizer, and augmenting the original dataset to avoid overfitting. We found that the optimizer of Stochastic Gradient Descent with Momentum and 3e-4 as a learning rate brings almost 98% model accuracy with generalization. Our study supports automated ewaste recycling in decreasing the error rate of ewaste sorting and investigates the advantages of applying transfer learning as the best scenario to overcome the rising challenges.

[CV-38] BoxCtrl: 3D-Aware Visual Prompting for Geometric Image Editing SIGGRAPH2026

链接: https://arxiv.org/abs/2606.23270
作者: Feifei Wang,Shiyuan Yang,Xiaoyu Li,Jing Liao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2026

点击查看摘要

Abstract:As instruction-based editing models and multimodal large language models advance, diverse image editing tasks have become feasible. However, achieving precise and consistent geometric image editing, such as translating, scaling, and rotating in 3D space, remains a major challenge. In this work, we introduce BoxCtrl, a 3D-aware visual prompting framework. Unlike text-only or coarse 2D-guided approaches, our method introduces informative RGB 3D bounding boxes projected onto 2D images as visual prompts. The three orthogonal faces of each box are painted with distinct RGB colors, simultaneously encoding position, size, and orientation to provide a compact, intuitive in-context visual example. The key to BoxCtrl’s success lies in these well-designed bounding boxes, which decouple geometric control from appearance control. This enables the model to learn consistent correspondences between faces of the same color in the latent space, leading to a precise understanding of geometric intentions and accurate editing results. We introduce a two-stage training paradigm: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). To address paired data scarcity, we construct a large-scale synthetic dataset for SFT, equipping the model with fundamental editing capabilities. To bridge the synthetic-to-real domain gap, we incorporate an online RL stage leveraging unpaired real-world data. Guided by a reward function evaluating geometric accuracy and visual fidelity, our SFT-RL strategy significantly enhances geometric precision while maintaining photorealistic quality. Extensive experiments demonstrate that BoxCtrl achieves state-of-the-art performance across translation, rotation, scaling, and composite editing tasks.

[CV-39] Safe Few-Step Generation via Velocity Editing

链接: https://arxiv.org/abs/2606.23267
作者: Yujin Choi,Jaehong Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Project Page: this https URL

点击查看摘要

Abstract:Flow matching has recently emerged as a strong paradigm for state-of-the-art text-to-image (T2I) generation, enabling high-quality generation with a small number of sampling steps. As these models are increasingly integrated into real-world applications, ensuring safe and non-sensitive content generation has become a critical requirement. However, adapting safety and concept removal methods to this new generation framework remains an open challenge. Specifically, prior methods largely rely on iterative trajectory steering across a number of denoising steps or on CLIP-centric prompt embedding manipulation. These design assumptions pose fundamental bottlenecks for safety in flow matching-based T2I generation, where limited sampling steps constrain iterative correction and modern context-aware text encoders diminish the effectiveness of embedding-level interventions. In this paper, we propose VESFlow, a training-free safety method tailored to flow matching with extremely few sampling steps. Leveraging the fact that flow matching models learn the marginal velocity, we directly edit the velocity field via a safe-conditional posterior. VESFlow steers the trajectory toward safe outputs while leaving the conditioning prompt unchanged. Building on the observation that VESFlow leaves outputs unchanged under benign prompts, we further introduce a risk score-based filtering that bypasses velocity editing to reduce computational cost while preserving benign prompt generation. Based on this filtering, we propose VESFlow+, a stronger variant of VESFlow that not only edits the velocity toward the safe direction, but also pushes it away from the unsafe direction. Experimental results show that VESFlow+ removes the target concept, reducing the attack success rate by NudeNet to 6.3% on Ring-A-Bell and 6.8% on MMA-Diffusion on the 4-step MeanFlow model, while preserving fidelity on benign prompts.

[CV-40] P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

链接: https://arxiv.org/abs/2606.23256
作者: Felix Tristram,Stefano Gasperini,Benjamin Killeen,Marcel Walch,Christian Benz,Nassir Navab,Ghazal Ghazaei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing maturity of embodied AI platforms has driven a growing interest in procedural video representation learning to support intelligent assistance systems for complex, multi-step tasks. Leveraging large-scale latent predictive training, video foundation models capture video dynamics, enabling downstream tasks such as activity understanding, spatiotemporal localization, and predictive control. However, procedural videos include actions with long-range dependencies that these models do not support, due to the quadratic complexity of self-attention. Distinct actions, for example, may be visually similar despite appearing at different points in the procedure, such as turning the stove on versus off. Here, we propose a backbone-agnostic approach that learns long-duration video representations by reducing the problem to a dense, frame-aligned action space and predicting pooled masked latent vectors. This approach allows our Procedural Joint Embedding Predictive Architecture (P-JEPA) to ingest videos over 30 minutes long, enabling effective long-form understanding of procedural steps. We evaluate P-JEPA using features extracted with VJEPA2.1, TSM, and I3D over the EgoExo4D, EgoProceL, and Assembly101 datasets, finding that it consistently improves linear separability, streaming inference, and temporal action segmentation performance, achieving state-of-the-art results on EgoExo4D fine-grained action classification while using an order of magnitude fewer parameters than LLM-based methods and running in real time.

[CV-41] SteerVTE: Seamless Video Text Editing with Style and Glyph Control

链接: https://arxiv.org/abs/2606.23254
作者: Kai Zeng,Moran Li,Zhengwei Wang,Yingchen Yu,Yiheng Lin,Ruichuan An,Ming Lu,Qi She,Wentao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual text editing aims to precisely modify text in images and videos while preserving stylistic consistency and visual realism. Despite significant advances in the image domain, video text editing remains largely unexplored: it is a localized task demanding stroke-level precision within small text regions, which compounds the challenges of cross-frame accuracy, temporal coherence, and stylistic fidelity. We introduce SteerVTE, a unified framework that \underline\textbfsteers a frozen video diffusion model to perform precise \underline\textbfVideo \underline\textbfText \underline\textbfEditing through style and glyph control. Built on a frozen diffusion transformer, SteerVTE attaches a lightweight text context adapter with two complementary modules: a style encoder capturing the original text’s visual attributes, and dual-granularity glyph encoders encoding the target text at both the line and character levels. To overcome the inherently weak text rendering priors of video foundation models, we further propose a glyph-aware spatial-focal loss and a three-stage progressive training curriculum that scales from image to video data. To support large-scale training, we also develop an automatic synthesis pipeline and construct SteerVTE-1M, a dataset of one million triplets spanning diverse scenes, fonts, and stylistic effects. Extensive experiments demonstrate that SteerVTE substantially outperforms existing video editing baselines across text accuracy, style consistency, and temporal coherence.

[CV-42] Privacy-Preserving Person Re-Identification from Temporal Sequences with Transformer and Hungarian Optimization

链接: https://arxiv.org/abs/2606.23230
作者: Raphaël Delécluse,Hazem Wannous,Laurent Guimas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at 2025 19th International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Person re-identification (Re-ID) is a crucial task in surveillance and human behavior analysis, often used in public spaces such as transport hubs. Traditional RGB-based Re-ID methods raise privacy concerns and are highly sensitive to lighting variations and occlusion. In this paper, we propose a novel Re-ID approach that leverages depth images, which inherently obscures facial and other identifiable features, making it a privacy-preserving solution. Our method addresses the association problem between multiple views of individuals by applying the Hungarian algorithm, optimizing the matching process through minimization of the global cost across the distance matrix. We further enhance the approach by introducing temporal sequences of frames as input to a Transformer encoder architecture, which exploits both RGB and depth modalities. This architecture captures dynamic movement patterns, improving feature extraction and re-identification accuracy. Additionally, we employ batch hard triplet loss to enhance discriminative feature learning by focusing on the hardest samples. We evaluate both depth-only and RGB-D models on several top-view datasets, including TVPR2, GODPR, and BIWI RGBD-ID. Our results demonstrate that depth-only re-identification can achieve competitive performance compared to state-of-the-art methods, as measured by standard metrics such as Cumulative Matching Characteristics (CMC) and Mean Average Precision (mAP), while prioritizing privacy preservation.

[CV-43] PhysFlow: Frequency Decoupled with Dual-Field Rectified Flow for Remote Photoplethysmography

链接: https://arxiv.org/abs/2606.23226
作者: Zixu Li,jianjun Qian,Hang Shao,Lei Luo,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Photoplethysmography (rPPG) enables contactless pulse estimation from facial videos, serving as a vital tool for health monitoring. However, current deep learning methods often struggle under complex disturbances, particularly varying illumination, facial expressions, and unconstrained head movements. In such scenarios, subtle physiological signals are easily dominated by external interference, making the recovered rPPG waveform unstable and unreliable. One important reason is that most existing methods directly model the rPPG signal in a unified manner, where different signal components are coupled during reconstruction. This makes it difficult to preserve weak pulse-related variations when strong disturbance-induced changes are present. To address this challenge, we propose PhysFlow, a frequency-decoupled dual-field rectified flow framework tailored for robust rPPG estimation. Specifically, the ground-truth rPPG signal is decomposed into trend and amplitude components, which are used as separate supervisory targets. Based on the extracted facial features, PhysFlow learns two component-specific conditional velocity fields to model the two components separately. This design reduces mutual interference between different components and improves the robustness of rPPG reconstruction under complex disturbances. Moreover, the rectified flow formulation enables efficient waveform reconstruction with only a few ordinary differential equation (ODE) integration steps. Extensive experiments on multiple benchmark datasets demonstrate that PhysFlow outperforms state-of-the-art methods in both heart-rate estimation and rPPG waveform reconstruction across diverse challenging scenarios.

[CV-44] RS-Gen: A Multi-Stage Agent ic Framework for Reasoning and Search-Augmented Image Generation

链接: https://arxiv.org/abs/2606.23221
作者: Feifei Bian,Zhimin Zheng,Wei Deng,Daiguo Zhou,Jian Luan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in image generation and editing, particularly regarding instruction following and visual fidelity. However, when handling ambiguous intentions, logical reasoning, and Out-of-Distribution (OOD) knowledge, existing image models often yield sub-optimal results due to a lack of deep reasoning capabilities and real-time external information. Although emerging unified understanding-and-generation models attempt to bridge this gap, they remain constrained by their intrinsic parameter scales and static knowledge gaps. Inspired by agentic paradigms, we propose RS-Gen: a plug-and-play, training-free, multi-stage image agentic framework. RS-Gen innovatively introduces a “Questioning-and-Solving” closed-loop mechanism to accurately identify logical issues and knowledge gaps, autonomously planning actions to bridge information deficits and execute deep logical reasoning. Extensive experiments demonstrate that RS-Gen significantly expands the capability boundaries of foundational image generation and editing models. Specifically, on the WISE Verified and RISEBench benchmarks, RS-Gen yields substantial absolute performance gains of 0.313 for Qwen-Image and 19.70 for Qwen-Image-Edit-2511, respectively, successfully elevating both to the state-of-the-art (SOTA) level among open-source models.

[CV-45] mporally Aware Densification for Dynamic 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.23212
作者: Vikram Sandu,Mayurdeep Pathak,Rajiv Soundararajan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite modeling temporal motion, dynamic 3D Gaussian Splatting (3DGS) methods still inherit a static densification strategy that is ill-suited for dynamic scenes. This neglect of temporal behavior leads to under-reconstructed and blurry dynamic regions, as short-lived Gaussians receive sparse supervision and fail to densify effectively. We propose a Visibility-Aware Densification (VAD) framework that integrates temporal visibility into the densification process, ensuring that Gaussians are refined based on their actual temporal presence. A Temporally-Adaptive Thresholding (TAT) mechanism further adjusts each Gaussian’s densification threshold according to its temporal lifespan, promoting balanced refinement of both static and dynamic regions. Finally, a Temporal Offset Warping (TOW) design enhances deformation capacity around temporal centers, extending the lifespan of highly dynamic Gaussians and facilitating more effective densification. Our approach achieves substantial improvements in the visual quality of dynamic regions, outperforming existing methods across three dynamic multi-view benchmark datasets. Moreover, the proposed VAD module generalizes across diverse dynamic 3DGS methods, consistently improving dynamic reconstruction as a plug-and-play component.

[CV-46] Unmasking LAION-5B: Age Gender Race and Emotion Biases in Large-Scale Image Datasets ICLR2026

链接: https://arxiv.org/abs/2606.23204
作者: Iris Dominguez-Catena,Daniel Paternain,Mikel Galar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil

点击查看摘要

Abstract:Large-scale image-text datasets, such as LAION-5B, are foundational to modern AI systems, yet their vast scale and uncurated nature raise significant concerns about demographic and stereotypical biases. This study presents a comprehensive analysis of the demographic composition and representational, stereotypical, and intersectional biases in LAION-2B-en and LAION-2B-multi, the two main components of the LAION-5B dataset. Using state-of-the-art models – FairFace, DeepFace, and Emo-AffectNet – we analyze faces detected in the dataset to identify biases across age, gender, race, and expressed emotion. Our findings reveal substantial overrepresentation of young adults (20–39), White individuals, and males, alongside consistent underrepresentation of minority racial groups and middle-aged or older women across both dataset components. We also observe stereotypical associations between demographic attributes and emotions, such as Anger'' being predominantly linked to males and Happiness’’ to females, pointing to systemic imbalances in the data. The consistency of these patterns across two demographic models and both components of LAION-5B demonstrates that these biases are deeply embedded in one of the most widely-used training datasets. Given the scale at which LAION-5B is used to train generative models, these demographic imbalances could shape the behavior and outputs of numerous downstream AI systems.

[CV-47] StreamPPG: Low-Latency rPPG Estimation via Consistent Privileged Learning

链接: https://arxiv.org/abs/2606.23186
作者: Yiming Li,Yihan Yang,Yuguang Chu,Yuanhui Hu,Si-Yuan Cao,Xiaohan Zhang,Xiaokai Bai,Zhe Wu,Hui-Liang Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) estimates the blood volume pulse (BVP) signal from facial videos, enabling contact-free health monitoring. Conventional clip-wise approaches, which use video clips as input, require capturing over one hundred frames before inference, thus introducing several seconds of delay and hindering real-time use. Meanwhile, frame-wise approaches struggle to capture long-range temporal and periodic features of physiological rhythms, and therefore lead to reduced estimation accuracy. To overcome these issues, we propose StreamPPG, a unified architecture that enables low-latency frame-wise physiological signal estimation while achieving competitive accuracy compared with clip-wise approaches. StreamPPG is trained under a consistent privileged learning (CPL) strategy, which leverages ground-truth rPPG signals as privileged information to enhance the model’s representation capability. Extensive experiments demonstrate that StreamPPG achieves state-of-the-art accuracy across multiple datasets while maintaining real-time throughput on edge devices.

[CV-48] Interpretable Probabilistic Medical Image Segmentation via Gaussian Process with Explicit Modelling of Annotation Bias and Variability MICCAI2026

链接: https://arxiv.org/abs/2606.23177
作者: Qi Li,Yuliang Huang,Shaheer U. Saeed,Qianye Yang,Vasilis Stavrinides,Zachary M. C. Baum,Dean C. Barratt,J. Alison Noble,Tom Vercauteren,Yipeng Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:Deep learning-based medical image segmentation models are trained using annotations that exhibit systematic bias and variability across raters. While probabilistic multi-rater approaches can emulate annotator-specific delineations, annotator characteristics are typically encoded implicitly in deep latent feature space, making direct analysis of their influence on predictive distributions less straightforward. We propose a logit-space probabilistic segmentation framework based on stochastic variational Gaussian Process that explicitly decomposes predictions into an image-dependent reference logit distribution and annotator specific perturbations parameterised by bias and variance. This formulation enables more explicit analysis on how intra- and inter-rater variability propagate to predictive distributions. We evaluate the method on a multi-annotator medical image dataset, which shows that explicitly modelling annotator specific perturbations improves uncertainty calibration while maintaining comparable segmentation accuracy, compared with state-of-the-art multi-rater probabilistic segmentation method. The learned bias and variance parameters quantitatively reflect annotator-specific behaviour. Furthermore, controlled perturbation experiments over bias and variance demonstrate how changes in annotator parameters systematically influence predictive performance. The code used in this paper is made publicly available at this https URL.

[CV-49] -VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models

链接: https://arxiv.org/abs/2606.23132
作者: Jaehyuk Jang,Minseok Seo. Seungju Cho,Kangwook Ko,Changick Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong zero-shot recognition, but they remain highly vulnerable to adversarial perturbations. Recent test-time adaptations improve robustness without retraining, but they do not directly adapt the corrupted visual representation itself. Prompt-based methods adapt the learnable text prompts, while input-space methods optimize pixels or padding at test time. These approaches can improve predictions, but they do so through an indirect and expensive optimization path. We propose Test-time Visual Subspace Steering (T-VSS), a lightweight defense that performs test-time adaptation directly in the visual feature space. T-VSS first builds a sample-specific low-rank subspace from multi-view feature residuals anchored at the attacked image. It then learns a shared feature correction within this subspace using reliability-weighted entropy minimization. By constraining adaptation to a compact visual geometry, T-VSS steers attacked features toward more stable and discriminative predictions while avoiding noisy full-space updates. Experiments on fine-grained, ImageNet, and ImageNet-OOD benchmarks show that T-VSS improves adversarial robustness while maintaining competitive clean accuracy and better efficiency than prior test-time adaptations.

[CV-50] Expert Consensus on Criteria for the Automated Assessment of Laparoscopic Camera Navigation

链接: https://arxiv.org/abs/2606.23131
作者: Amir Ebrahimzadeh,Nazila Esmaeili,Michael Ghadimi,Jannis Hagenah
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Laparoscopic camera navigation (LCN) is a critical skill, yet its current assessment typically relies on manual rating systems which are time-consuming and difficult to scale. Automated feedback could significantly enhance surgical training by providing immediate, standardized metrics. This study aims to define, clinically evaluate the relevance, and establish the technical readiness of a set of approaches for LCN assessment. Methods: We developed a detailed taxonomy of 14 key aspects of camera navigation, categorized into Framing Composition, Visibility Clarity, Orientation Stability, Motion Dynamics, and Safety Awareness. For each aspect, we assessed the technological readiness of automated measurement based on the current state of the art (SoTA) in computer vision (CV). To establish clinical relevance, we designed a survey for practicing laparoscopic surgeons to rate the importance of each aspect on a 5-point Likert scale and to select the five most critical skills. Results: 23 surgeons participated in the survey. Foundational aspects like Field of View, Focus and Centering were rated as most important by surgeons. We present a “Clinical Importance vs. CV Technological Readiness” matrix, identifying high-priority targets for development–aspects that are both clinically crucial and technologically ready to measure. Conclusion: This work establishes a foundational framework for quantifying LCN skills. By aligning surgeon priorities with CV capabilities, we provide a clear roadmap for automatic skill assessment. This foundation enables the development of AI-driven assistance tools that can accelerate the learning curve for surgical assistants and potentially improve surgical safety and efficiency. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.23131 [cs.CV] (or arXiv:2606.23131v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.23131 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Amir Ebrahimzadeh [view email] [v1] Mon, 22 Jun 2026 10:19:38 UTC (1,663 KB)

[CV-51] Spectral Gating via Damped Oscillations for Adaptive Implicit Neural Representations ECCV2026

链接: https://arxiv.org/abs/2606.23129
作者: Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ECCV 2026. Project Page: this https URL

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have been proven successful in encoding continuous signals through coordinate-based networks, yet facing a spectral dilemma: periodic activations capture fine details but act as all-pass filters that memorise noise, while spatially compact activations regularise effectively but suffer from low-frequency bias. Existing attempts to resolve this trade-off introduce computational overhead or tuning frailty. We propose to model each neuron’s activation as the steady-state response of a sinusoidally-forced damped harmonic oscillator, whose amplitude naturally governs the network’s spectral selectivity during training. By jointly optimising the oscillator parameters alongside the network weights, our method adapts to the target signal’s spectral content without explicit regularisation. Initialised in the stopband, the network exhibits a coarse-to-fine learning curriculum that progressively expands its spectral gate, capturing low-frequency structures first and high-frequency details only when justified by the reconstruction objective. Comprehensive experiments show that our approach consistently achieves state-of-the-art or competitive results against established INRs, while requiring no task-specific tuning of any hyperparameters.

[CV-52] MambaADv2: Evolving Duality-enhanced State Space Model for Unsupervised Anomaly Detection

链接: https://arxiv.org/abs/2606.23126
作者: Xiaobin Hu,Haoyang He,Bo Yin,Yu He,Lei Xie,Jiangning Zhang,Yu-Gang Jiang,Shuicheng Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advancements in anomaly detection have demonstrated the efficacy of CNN- and Transformer-based approaches, these architectures face inherent limitations: CNNs struggle to capture long-range dependencies, whereas Transformers suffer from quadratic computational complexity. Consequently, Mamba-based architectures have attracted considerable attention, as they successfully combine superior long-range dependency modeling with linear computational complexity. By critically rethinking the structural evolution across the Mamba lineage 1-3 series, this paper proposes MambaADv2, a framework tailored for multi-class unsupervised anomaly detection. MambaADv2 comprises a pre-trained encoder and a Mamba-inspired decoder, equipped with Duality-enhanced State Space (DSS) modules across multiple scales. The proposed DSS module effectively models both global dependencies and local representations by integrating parallel-cascaded Hybrid State Space (HSS) blocks and frequency-enhanced convolution operations. The structure of the Hybrid State Space (HSS) block is tailored by following the SSD-based Mamba lineage and incorporating Mamba3-style position-aware state-space modeling, leveraging the dual computational paths of linear recurrence and parallel matrix formulation to model local continuity and global contextual comparison, thereby better serving the core anomaly detection objective of precisely reconstructing normal representations while magnifying anomalous deviations. Additionally, we propose a semantics-adaptive progressive scanning strategy that decays scanning complexity along the feature pyramid.

[CV-53] LUMINA-26: Low-Light Understanding for Modeling and Interpreting Night-time Actions

链接: https://arxiv.org/abs/2606.23118
作者: Aman Kumar Pandey,Anil Singh Parihar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures. Preprint

点击查看摘要

Abstract:Low-light human action recognition remains a challenging problem due to poor illumination, amplified noise, motion ambiguity, and diverse real-world scenes. Existing low-light datasets often lack sufficient action diversity, capture realism, or balanced class distribution, limiting the development of robust models. To address this, we introduce LUMINA-26: Low-Light Understanding for Modeling and Interpreting Night-time Actions, comprising 6,784 clips across 26 action classes, recorded from 22 subjects across 20 indoor and outdoor locations under naturally occurring low-light conditions. We also propose Illumi-Net: An Illumination-Adaptive Mixture-of-Experts Network, which leverages video-level illumination cues to guide adaptive enhancement and transformer-based spatio-temporal feature extraction, with expert-conditioned decision fusion. Our method surpasses previous state-of-the-art performance on ELLAR (Top-1: 55.13%, Top-5: 78.87%) and establishes a strong baseline on LUMINA-26 (Top-1: 75.95%, Top-5: 93.58%), offering a practical benchmark for future low-light action recognition research.

[CV-54] chnical Report for the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Pretraining-Diverse Ensemble of Foundation Vision Encoders for Robust Outdoor Scene Understanding

链接: https://arxiv.org/abs/2606.23113
作者: Boyan Wang,Yongxi Huang,Wenjing Li,Tianrui Hui,Shaofei Huang,Nan Pu,Zhun Zhong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report presents our solution for the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which requires parsing unstructured outdoor scenes from four camera platforms into 56 fine-grained categories. Our approach pairs foundation vision encoders (including DINOv3, SigLIP2, and InternImage) with a Mask2Former decoder, and trains them with a strong recipe including long training schedules, exponential moving average, a larger crop size, and multi-scale plus flip test-time augmentation. The three encoders, chosen for their complementary pretraining objectives, are combined into a pretraining-diverse ensemble through per-class validation-IoU weighting. Evaluated on the official GOOSE test set, our submission achieves 75.40% composite mIoU and wins the second place of the challenge. Our study further shows that the encoder’s pretraining recipe, rather than its parameter count or the decoder design, is the dominant factor for accuracy on this benchmark.

[CV-55] Compression and Retrieval: Implicit Memory Retrieval for Video World Models

链接: https://arxiv.org/abs/2606.23105
作者: Zhan Peng,Jie Ma,Huiqiang Sun,Chong Gao,Zhijie Xue,Zhiyu Pan,Zhiguo Cao,Jun Liang,Jing Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video world models hold promise for simulating interactive environments, yet maintaining consistent long-term memory across complex camera trajectories remains a critical challenge. Existing methods typically rely on computationally expensive context scaling or rigid heuristic retrieval mechanisms, which lacks generalization to varying camera trajectories and environments. In this paper, we propose Compression and Retrieval (CaR), an attention-driven implicit memory retrieval mechanism to overcome these limitations. By injecting viewpoint information via positional encoding, our method performs flexible memory retrieval through attention computation. To efficiently process extended contexts with minimal computational overhead, we further introduce a lightweight context compression network. Furthermore, we construct SceneFly, a large-scale synthetic dataset featuring realistic camera trajectories and frame-level annotations to train and evaluate long-horizon video world models. Extensive experiments demonstrate that our approach achieves state-of-the-art results on established benchmarks and exhibits strong generalization to open-domain scenes.

[CV-56] Scene-agnostic ALS boresight self-calibration

链接: https://arxiv.org/abs/2606.23101
作者: Aurélien Brun,Jan Skaloud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:ALS boresight calibration has relied for two decades on dedicated flight patterns over structured scenes containing planar surfaces of varied aspect and slope. While reliable, this approach imposes constraints on the scene content and operations, which limits its applicability to boresight recovery within routine mapping missions. We present a practical approach that substantially relaxes these requirements by replacing plane-based constraints with scene-agnostic point-to-point correspondences extracted automatically from overlapping ALS strips. Two complementary formulations are proposed to estimate boresight with laser vector observations: (i) a simpler parametric adjustment utilizing INS/GNSS trajectory; (ii) a rigorous formulation treating GNSS and raw inertial data within an existing factor-graph, i.e. a dynamic network, where boresight is added as an additional parameter. Both formulations are evaluated across four operational ALS flights equipped with five inertial systems, covering a wide range of flight altitudes, overlap geometries, terrain types and inertial sensor classes. The analysis draws a clear boundary between the legacy plane-based conditioning that falls short outside the calibration scenario and the proposed formulations, which either recover or absorb boresight effects under conventional mapping geometry. Among them, the lightweight formulation is sufficient for boresight recovery using tactical and navigation grade inertial sensors, while the general factor-graph approach is clearly superior when the inertial sensor errors are less observable within an optimal smoother. This supports the hypothesis that, for INS/GNSS trajectory of sufficient quality, the boresight calibration can be performed without particular scene prerequisites during routine mapping operations using a minimum of 3-4 overlapping strips, with either proposed formulation…

[CV-57] Poisson2Gaussian: Noise Gaussianization to Enhance Image Denoising

链接: https://arxiv.org/abs/2606.23098
作者: Xirou Zhou,Zijing Xu,Yibo Qu,Qi Zhang,Xiaowan Hu,Xinyang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The quantum nature of light determines the inherent Poisson stochasticity of photon detection, which is ubiquitous in photography, microscopy, and astronomy. However, our controlled numerical studies reveal that the signal-dependency, heteroscedasticity, and statistical asymmetry of Poisson-mixed noise make it challenging for existing denoisers to learn. In contrast, i.i.d. Gaussian noise, with its statistical independence and symmetric distribution, is easier to model for networks. To address this gap, we propose Poisson2Gaussian (P2G), a noise Gaussianization method that explicitly converts complex real-world noise to i.i.d. Gaussian noise via probability density matching beyond low-order moments. We also design an unbiased denoising framework that synergizes P2G with downstream denoisers, ensuring convergence to the underlying signal without requiring paired clean data or explicit noise parameters. Extensive experiments demonstrate that P2G consistently achieves state-of-the-art performance across diverse datasets. In challenging scenarios where noise strongly deviates from Gaussian statistics, our method improves the PSNR by up to 0.75 dB. Notably, P2G is architecture-agnostic and can provide universal improvements for various denoisers. The source code will be publicly available.

[CV-58] Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection ECCV2026

链接: https://arxiv.org/abs/2606.23069
作者: KunHo Heo,Seungjae kim,Wongyu Lee,SuYeon Kim,MyeongAh Cho
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Code: this https URL

点击查看摘要

Abstract:Few-shot object detection aims to detect novel object categories from only a few labeled examples, avoiding costly large-scale annotation. Recent prototype-based similarity learning approaches enable training-free adaptation by matching query features with class prototypes. However, they suffer from two fundamental limitations: (i) class confusion arising from inter-class similarity margin collapse, and (ii) insufficient visual cues for precise localization, as similarity scores capture only class-level semantic affinity while providing limited spatial information. To address these issues, we introduce two complementary components. Text-Anchored Semantic Mask (TSMa) leverages class-level text features as semantic anchors to identify semantically aligned channels through channel-wise interaction between visual and text features. By suppressing style-induced spurious responses and emphasizing class-intrinsic signals, TSMa enlarges inter-class similarity margins and mitigates class confusion. We further propose Stage-Aligned Hierarchical Autoregressive Regression (SHARe), which reformulates localization as a hierarchical autoregressive process that progressively refines bounding boxes across multiple stages. SHARe leverages the layer-wise characteristics of ViT representations by aligning feature abstraction levels with regression stages: deeper layers guide early coarse localization, while shallower layers rich in edge and texture cues refine spatial details in later stages. Experiments on COCO demonstrate a new state of the art, outperforming the previous best by +10.1 nAP, with extensive analysis validating each component. The code is available at this https URL.

[CV-59] Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLM s

链接: https://arxiv.org/abs/2606.23063
作者: Chuangxin Zhao,Canran Xiao,Siyuan Ma,Mengyao Lyu,Yanbiao Ma,Jun Xia,Guiguang Ding,Yang Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at this https URL

[CV-60] VolHuMe: a High-Resolution Large Scale Dataset of Volumetric Human Meshes

链接: https://arxiv.org/abs/2606.23062
作者: Giulia Martinelli,Niccolò Bisagno,Nicola Garau,Esa Rahtu,Nicola Conci
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce VolHuMe, a dataset of high-quality 4D human scans captured with a state-of-the-art volumetric studio using 64 RGB and 32 depth cameras. VolHuMe contains individual captures of 104 subjects and provides extensive ground truth, including SMPL-X, high-resolution meshes, multi-view RGB/depth images, rigged meshes, point clouds, garment segmentation, and detailed hand and facial geometry. Unlike prior datasets that primarily rely on full-body imagery, VolHuMe uses a close-range, high-resolution capture setup that preserves fine-grained body-part details, improving geometric fidelity and texture resolution. We benchmark VolHuMe on state-of-the-art methods across 3D and 4D human reconstruction tasks, showcasing the dataset’s quality and exposing the limitations of current evaluation testbeds.

[CV-61] MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning

链接: https://arxiv.org/abs/2606.23061
作者: Weile Guo,Shenghong He,Danying Mo,Chengdong Xu,Xuexun Liu,Chao Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Motion instruction generation in cross-video comparison aims to produce corrective feedback that describes the differences between a query and a reference motion. However, existing models often generate instructions that exhibit motion hallucinations, failing to reflect actual kinematic differences between paired videos. To systematically investigate these hallucinations, we introduce MotionHalluc, a dedicated benchmark for evaluating motion hallucinations in paired-video comparison. MotionHalluc comprises 1540 fine-grained questions over 553 video pairs, evaluating hallucinations along three core dimensions: (1)directional hallucination, (2)attributional hallucination, and (3)temporal hallucination. Extensive evaluations of state-of-the-art large multimodal models demonstrate high susceptibility to these hallucinations. Furthermore, we provide Perceive-Parse-Verify (PPV) as a training-free measurements extraction and verification baseline that converts candidate instructions into executable measurement queries and supplies kinematic measurements at inference time. Our results show that this simple measurements injection yields an average 10.6% performance gain across models, suggesting that motion reasoning with explicit quantitative measurements is a key factor in reducing hallucinations in cross-video comparison. Our code and dataset will be made publicly available upon acceptance.

[CV-62] hree-Step Hierarchical Transformer for Multi-Pedestrian Trajectory Prediction

链接: https://arxiv.org/abs/2606.23058
作者: Raphaël Delécluse,Hazem Wannous,Laurent Grisoni,Laurent Guimas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction requires modeling temporal dynamics, multimodal cues, and social interactions in crowded environments. Existing methods often address these factors separately or entangle them in costly attention blocks, limiting scalability, flexibility, and interpretability. We propose a three-step hierarchical Transformer that explicitly separates temporal encoding, multimodal fusion, and scene-level interaction reasoning. Lightweight GRU summaries enable efficient cross-modal attention, while social attention over time–agent tokens captures inter-pedestrian influences at manageable cost. Experiments on JTA, JRDB, and the Pedestrians and Cyclists in Road Traffic dataset show state-of-the-art performance on real-world datasets (JRDB, Urban) and competitive results on JTA. Ablation and qualitative analyses confirm the contribution of each stage and the model’s ability to anticipate complex behaviors such as early turning.

[CV-63] UECP: Uncertainty-Enhanced Collaborative Perception

链接: https://arxiv.org/abs/2606.23046
作者: Kang Yang,Tianci Bu,Peng Wang,Deying Li,Wen Jie,Yongcai Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:Collaborative perception serves as a pivotal solution to enhance the perception capability of individual agents in autonomous driving, where a core challenge lies in seeking reliable evidence to quantify and weight the contribution of each participating agent. Existing methods typically rely on a confidence map, which is co-trained with the detection head, but it is inherently correlated with the detection results and thus fails to provide unbiased physical evidence. Furthermore, how to deeply integrate evidence into the cooperative fusion process remains an open question. To address these issues, this paper first proposes an uncertainty map, a physically grounded and unambiguous metric for evaluating perception quality. This map is directly supervised by real-time sensor signals, i.e., LiDAR point density, ensuring decoupling from detection noise and thereby providing physical scenario-aware evidence for weighting agent contribution. Based on this map, we develop the Uncertainty-Enhanced Collaborative Perception (UECP) framework, centered on the Uncertainty-Aware Pyramid Fusion (UAPF) module. UAPF uses a coarse-to-fine strategy, with two key components: Uncertainty-Weighted Downsampling (UWD) for high-fidelity feature preservation, and Uncertainty-Guided Residual Fusion (UGRF) to reinforce ego features, suppressing noise and ensuring robust fusion. Extensive experiments on real-world datasets show UECP outperforms state-of-the-art methods in effectiveness and robustness by embedding the uncertainty map into fusion. Code will be publicly available.

[CV-64] SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models ECCV2026

链接: https://arxiv.org/abs/2606.23041
作者: Hongxiang Li,Hongxu Chen,Chenyang Zhu,Xiaoshuang Huang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Long Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbfSemantic-\textbfPixel self-alignment and \textbfAdaptive \textbfRouting (\textbfSPAR). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

[CV-65] DrivingVoxels: Compositional Sparse Voxel Rasterization for Dynamic Driving Scene Reconstruction

链接: https://arxiv.org/abs/2606.23031
作者: Tania Aguirre,Luis Roldão,Moussab Bennehar,Nathan Piasco,Dzmitry Tsishkou,Simone Rossi,Pietro Michiardi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic urban scenes remains challenging due to the unbounded nature of driving environments and the presence of multiple dynamic objects. Currently, potentially faster sparse voxel methods are mainly designed for static scenarios. On the other hand, dynamic approaches based on 3D Gaussian Splatting, despite their high-fidelity, are often time-consuming for driving scenarios and exhibit uncontrollable memory growth in large scenes. To address these limitations, we present DrivingVoxels, a compositional sparse voxel rendering framework for dynamic driving scenes. Our method jointly rasterizes sparse voxels from multiple independent octrees within a single rendering pass. Each rigid dynamic object is represented by an octree defined in its local coordinate frame, while a separate static octree models the stationary background. DrivingVoxels adopts a fully explicit, neural-free representation together with a LiDAR-guided structural initialization that efficiently captures scene geometry. We evaluate our framework on the PandaSet benchmark, demonstrating that DrivingVoxels performs on par on perceptual metrics and better on structural metrics for NVS and reconstruction while requiring shorter training times than previous 3DGS-base methods to an efficient optimization workflow anchored by a strong LiDAR prior.

[CV-66] Physics-Guided Spatiotemporal State Space Modeling for Lookahead Molten Pool Segmentation in Laser Wire-Feed Welding

链接: https://arxiv.org/abs/2606.23028
作者: Sen Li,Haichao Cui,Changhao Yin,Chendong Shao,Yaqi Wang,Xinhua Tang,Fenggui Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time weld-pool perception is critical for closed-loop control in laser wire-feed welding, where sensing, computation, and actuator response introduce unavoidable delay. This paper presents a physics-guided spatiotemporal state space network for lookahead weld-pool segmentation. The model uses historical coaxial grayscale images, welding process parameters, and aligned wire-state electrical signals to predict the future semantic layout of three physically meaningful regions: keyhole, wire, and molten pool. It combines a visual encoder, process- and sensor-conditioned feature normalization, patch-level temporal state space modeling, horizon-conditioned latent prediction, dense future feature prediction, and a motion-aware mask decoder. Auxiliary signed-distance-function supervision, temporal consistency, feature distillation, and fine-grained keyhole losses further constrain the predicted geometry and local motion. Experiments on a 43-sequence laser welding dataset show that the proposed WeldMamba reaches 74.63% mIoU at a 500 ms lookahead. Ablation studies further show that temporal history, patch-level state space modeling, and keyhole motion awareness are the main contributors to robust future segmentation.

[CV-67] Learning Stable Canonical Worlds for Novel View Synthesis and Beyond

链接: https://arxiv.org/abs/2606.23027
作者: Xiaoyu Xu,Jian Zou,Sheyang Tang,Zhihua Wang,Jing Liao,Kede Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward Gaussian splatting (FFGS) facilitates real-time novel view synthesis, yet current methods often remain tied to view-dependent predictions. As more input views are added, they may accumulate noisy or redundant evidence instead of converging to a stable scene representation. In this paper, we introduce CanonicalGS, a feed-forward pipeline that maps cluttered multi-view observations into a stable, scene-centric representation. CanonicalGS first extracts view-centric evidence from depth, semantic features, and uncertainty estimates, and then aggregates this evidence in a canonical latent world using uncertainty-aware fusion. By emphasizing reliable observations while suppressing uncertain or redundant ones, CanonicalGS produces representations that scale more effectively for novel view synthesis and transfer to downstream visual perception tasks. Experiments show up to a 2.5 dB improvement in peak signal-to-noise ratio for synthesizing novel views and an 11% gain in semantic segmentation accuracy.

[CV-68] Boosting Neural Video Codec via Scale-Driven Online Flow Refinement ICME2026

链接: https://arxiv.org/abs/2606.23023
作者: Tiange Zhang,Rongqun Lin,Haocheng Tang,Xiandong Meng,Weijia Jiang,Zhimeng Huang,Siwei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2026 as an oral paper

点击查看摘要

Abstract:Although state-of-the-art neural video codecs (NVCs) have achieved remarkable performance, they suffer from limited generalization when encountering complex motion patterns unseen during training. To bridge this domain gap without the expensive cost of online fine-tuning, we propose a Training-Free Scale-Driven Online Flow Refinement (SOFR) method. Serving as a plug-and-play module, SOFR integrates motion information from coarse and fine scales and dynamically fuses them according to warping accuracy, effectively rectifying motion estimation errors with negligible computational overhead. Furthermore, we design a rate-aware strategy that selects different dynamic fusion strategies according to bitrate modes, and employs a reliability check based on warping error to ensure robustness. Extensive experiments on the USTC-TD dataset verify the effectiveness and generalization of SOFR across various NVC frameworks, including DCVC-SDD, DCVC-FM, and EHVC. Notably, it brings an average of 2.84% and 4.05% bitrate savings in terms of PSNR and MS-SSIM, respectively, to DCVC-FM with negligible coding time increase. Our code is available at this https URL.

[CV-69] ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers

链接: https://arxiv.org/abs/2606.23019
作者: Ruiliang Zhou,Xuecheng Wu,Kang He,Guangyun Han,Bin Liu,Qinqin Chen,Wende Xu,Qingjie Zhao,Chengru Song
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:While Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, their reliance on 3D full attention creates a quadratic computational bottleneck. Existing sparse methods face a dilemma: dynamic pruning suffers from prohibitive runtime overhead and memory fragmentation, while static heuristics fail to capture fine-grained dependencies. In this work, we propose ScalingAttention, a training-free framework grounded in a key inductive bias: while individual activations are input-dependent, the high-mass attention regions for each head rapidly converge to a stable, prompt-agnostic Intrinsic Sparse Topology. This topology is weight-encoded, scale-invariant, and efficient to extract. ScalingAttention decouples topology discovery from sparsity control via: (1) WEST (Weight-Encoded Sparse Topology), which extracts a robust block-sparse prior mask offline to eliminate runtime search; (2) FAST (Fidelity-Aware Sensitivity Tuning), which adaptively tunes head-wise sparsity based on diffusion fidelity requirements. To ensure practical acceleration, we co-design a hardware-aligned bit-wise block-sparse kernel. Experiments on Wan2.1 show up to 1.90X end-to-end speedup with superior fidelity, establishing a new Pareto frontier over state-of-the-art baselines.

[CV-70] From Point Estimates to Distributions: GMM Pooling for MIL in Preterm Birth Prediction MICCAI2026

链接: https://arxiv.org/abs/2606.23005
作者: Hussain Alasmawi,Numan Saeed,Soha Said,Mohammad Yaqub
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2026

点击查看摘要

Abstract:Preterm birth (PTB) prediction can enable targeted surveillance and timely intervention, yet most ultrasound-based models use a single selected transvaginal ultrasound (TVUS) frame per patient despite routine exams acquiring multiple cervical images. We formulate PTB prediction as a multiple instance learning (MIL) problem, representing each patient as a variable-sized bag of TVUS images with a single outcome label. To move beyond standard MIL aggregators that collapse a bag into a point estimate, we propose a Gaussian Mixture Model (GMM) pooling, which summarizes all images in a bag into a fixed-length representation by modeling their feature distribution. This design captures intra-patient variability. We evaluate the method on a private clinical cohort and on a public lymph node metastasis benchmark. For PTB prediction, GMM pooling improves over the instance-based model PR-AUC from 0.44 to 0.56. On the lymph node benchmark, it achieves state-of-the-art performance with 0.91 F1-score and 0.89 ROC-AUC for classification and 0.18 MAE for regression. The code is publicly available at this https URL.

[CV-71] MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations ICML2026

链接: https://arxiv.org/abs/2606.23000
作者: Yuhua Luo,Junsheng Zhang,Mengyin Liu,Xincheng Lin,Ming Yan,Zhudi Chen,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Human motion follows a temporal hierarchical structure, transitioning from low-frequency global trajectories to high-frequency details. Inspired by the success of multi-level autoregressive models in computer vision, we propose MotionMAR, a coarse-to-fine framework for motion reconstruction from sparse observations. It first estimates the global trajectory of human motion and then gradually refines the temporal details. This architecture consists of four integrated components. The Temporal Multi-scale Tokenization (TMT) VQ-VAE encodes the data at multiple temporal resolutions, separating semantic motion from minor jitters. The Motion Autoregressive Network (MAN) operates in this latent space, predicting motion across scales. It first establishes the global structure through coarse indices and then generates finer indices to recover specific details. Meanwhile, the Scale-Aware Control (SAC) module integrates sparse tracking data to ensure the generated output aligns with actual observations. The Motion Refinement Network (MRN) subsequently smooths consecutive poses and eliminates quantization artifacts. Experiments show that MotionMAR achieves state-of-the-art accuracy on the AMASS dataset, providing a reliable and structure-aware approach for motion reconstruction. The source code is publicly available at this http URL.

[CV-72] Black-Box Continual Learning for Vision-Language Models

链接: https://arxiv.org/abs/2606.22999
作者: Yuting Li,Weihang Fang,Haoyuan Gao,Linghe Kong,Yexin Li,Lichao Sun,Weiran Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid deployment of Vision-Language Models (VLMs) in dynamic environments necessitates the ability to learn continuously without forgetting. However, traditional continual learning (CL) settings often rely on white-box paradigms, which is increasingly invalidated by the shift toward cloud-hosted models. In this paper, we introduce Black-CL, a more realistic benchmark for VLMs that enforces three primary real-world challenges: weight and architecture inaccessibility, constrained computation, and task-agnostic inference. The learner can query only output embeddings or logits, with no gradient flow through or structural modification of the backbone. Current CL methodologies, which rely on backbone backpropagation or complex parameter expansion, are fundamentally incompatible with these constraints. Under this setting, we propose BETA, a simple yet effective baseline built on the key insight that solely optimizing textual prototypes can navigate the complexities of CL. BETA integrates three core components: Semantic Projection Accumulation (SPA) for incremental knowledge acquisition, Latent Distribution Replay (LDR) for anchoring the embedding space against catastrophic forgetting, and Test-Time Prototype Adaptation (TTPA) for dynamic, instance-aware boundary refinement. Extensive experiments across ten diverse datasets and various backbones demonstrate that BETA significantly outperforms existing black-box tuners. Remarkably, with only 0.05 M trainable parameters, a 180–3000 \times reduction compared to competitive methods, BETA achieves performance on par with or even exceeding white-box CL methods. We believe Black-CL and BETA provide a foundational framework for future advancements in continual learning and accelerates the transition of continual learning from academia to real-world systems.

[CV-73] Can Single-View Mesh Reconstruction Generalize to Robot Camera Rotation?

链接: https://arxiv.org/abs/2606.22987
作者: Yu Zhan,Guangcheng Chen,Hanjing Ye,Zhiqin Cheng,Zanjia Tong,Wenjun Xu,Hong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Single-view mesh reconstruction predicts object meshes and spatial layouts from a single observation, making it attractive for fast robot spatial reasoning and real-to-sim digital twins. However, robot-mounted cameras naturally rotate during manipulation and navigation, while learned single-view reconstruction models often rely on view-dependent priors and may generalize poorly to out-of-distribution camera rotations. Such rotations can introduce 3D inconsistencies, incorrect layouts, and violations of physical constraints, but this failure mode remains under-evaluated. We introduce an evaluation protocol with controlled axis-wise roll, pitch, and yaw sweeps to trace errors in monocular depth estimation (MDE), canonical object meshes, camera-space layout, and physical plausibility within a representative SAM3D-style pipeline. On the Aria Digital Twin dataset and a real Franka wrist-camera sequence, camera rotations induce MDE distortion, layout drift, and collision penetration, while canonical mesh predictions remain relatively stable. A two-stage SAM3D+FoundationPose pipeline is more robust than one-stage feed-forward layout prediction, and our Gravity-Aware Refinement reduces one-stage pairwise ICP-based layout-orientation error by 47.1 % . Our evaluation reveals that current single-view mesh reconstruction methods generalize poorly to robot camera rotation, and suggests that explicit gravity cues are important for reliable robotic single-view mesh reconstruction.

[CV-74] Subject-Level Unknown-Identity Identification from Leap Motion Controller 2 Hand Landmarks

链接: https://arxiv.org/abs/2606.22986
作者: Bahar Moharrer,Susanna Cifani,Marco Raoul Marini,Luigi Cinque,Maria De Marsico
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS 2026)

点击查看摘要

Abstract:This work studies subject recognition from Leap Motion Controller 2 (LMC2) hand landmark data under a subject-level unknown-identity identification protocol on the Multi View Leap2 Hand Pose (ML2HP) dataset. Using only the landmark modality, we retain the original geometric representation and enrich it with fingertip-to-palm distances and palm-normalized inter-finger angular descriptors. Evaluation is performed under a Leave-One-Subject-Out (LOSO) protocol in which, for each outer fold, one subject is excluded from the enrolled set and treated as unknown at test time. To avoid tuning on the true outer unknown subject, the unknown-rejection threshold is selected in an inner validation step by temporarily withholding one enrolled subject from the inner gallery and using it only for threshold estimation. We compare a tree ensemble baseline with two neural alternatives: a learned embedding baseline based on centroid matching and cosine-similarity-based rejection, and an MLP+OpenMax model, which represents a more established open-set recognition approach. Under this evaluation setup, Extra Trees remains the strongest overall method, indicating that the main challenge on this benchmark is not enrolled-subject discrimination alone, but robust score separation between known and unknown probes. The results support the feasibility of compact, interpretable landmark-based descriptors for contactless hand-based unknown-subject rejection and identification on a small-cohort dataset.

[CV-75] Humanoid-OmniOcc: Stereo-Based Full-View Occupancy Dataset for Embodied AI

链接: https://arxiv.org/abs/2606.22971
作者: Xianda Guo,Bohao Zhang,Chenwei Huang,Shiyuan Chen,Ruilin Wang,Yiqun Duan,Cong Yang,Qin Zou,Wei Sui
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occupancy prediction at voxel-level granularity is essential for safe robotic navigation and interaction in complex environments. Existing occupancy datasets, however, are predominantly designed for autonomous driving with vehicle-centric biases – forward-facing cameras, far-field geometry, and static road priors – limiting their applicability to embodied humanoid perception. We present Humanoid-OmniOcc, a large-scale panoramic stereo-based occupancy dataset tailored for humanoid robots. The dataset encompasses 15 diverse simulated indoor scenes and 5 real-world environments, yielding over 155K samples with broad scene and style diversity. Importantly, the dataset is designed around a Real2Sim2Real closed-loop paradigm: real sensor specifications drive physically accurate simulation, simulation produces large-scale annotated training data, and models trained in simulation are directly evaluated on real-world captures – enabling iterative refinement of the sim-to-real pipeline. We further propose \textbfHumanoid \textbfSurround \textbfStereo-guided \textbfOccupancy model (Humanoid-OmniOcc) that exploits robust depth priors for accurate 2D-to-3D lifting. Extensive experiments show that Humanoid-OmniOcc consistently outperforms monocular baselines and generalizes well to both unseen simulated test scenes and real-world environments, validating the effectiveness of the Real2Sim2Real design. Code and data will be available upon acceptance at this https URL.

[CV-76] Concept Alignment Contrast and Long-Short Prompt Memory for Test-Time Adaptation of SAM3 in Medical Image Segmentation

链接: https://arxiv.org/abs/2606.22963
作者: Yubo Zhou,Jianghao Wu,Ping Ye,Shaoting Zhang,Guotai Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Concept segmentation models like Segment Anything Model 3 (SAM3) show strong generalization on natural images, yet their performance degrades in medical imaging due to the domain gap caused by different imaging principles and styles. Test-Time Adaptation (TTA) is essential for improving the testing performance by updating the model on the fly without annotations. However, existing vision-language TTA methods are mainly driven by image-level uncertainty minimization, which does not necessarily reflect region-level semantic correctness in medical segmentation. Moreover, they often lack mechanisms to maintain stability in continual one-pass adaptation, leading to limited performance when reliable dense supervision is missing for segmentation. To address these issues, we propose Concept Alignment Contrast and LongShort Prompt Memory for Test-Time Adaptation (CM-TTA) of SAM3 for medical images. First, for a test sample with multiple augmentations, we introduce a novel Concept Alignment Contrast (CAC) metric, which leverages textual-visual semantic consistency to robustly evaluate prediction quality to select the best augmented view as the supervision. Second, to balance rapid and stable adaptation, we design a Long-Short Prompt Memory (LSPM) module. The short memory dynamically fuses recent prompts based on CAC scores for agile local adaptation, while the long memory maintains a stable global prompt to generate enhanced pseudo-labels. Finally, a Densely Supervised Prompt Update (DSPU) strategy is proposed to optimize the prompt embeddings with enhanced pseudo labels as dense supervision. Extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods for TTA of SAM3.

[CV-77] he Impact of VAE Design on Latent Pose Representations for Diffusion-based Sign Language Production

链接: https://arxiv.org/abs/2606.22959
作者: Guilhem Fauré(MULTISPEECH),Mostafa Sadeghi(MULTISPEECH),Sam Bigeard(MULTISPEECH),Slim Ouni(LORIA)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent diffusion approaches to sign language production (SLP) rely on an initial stage that learns an encoding of sign pose sequences, enabling generative modeling in the resulting latent space. The autoencoder used in this stage is typically evaluated in terms of reconstruction quality using geometric metrics common in SLP. While informative, these metrics do not fully capture latent space properties that may influence the training and performance of the downstream generative model. In this work, we investigate how architectural and training objective design choices in a variational autoencoder (VAE) for sign pose encoding affect latent space structure, and how these differences translate into the performance of a latent diffusion model for text-to-sign generation. Our experiments on Phoenix14T dataset show that variations in generative performance, measured through back-translation BLEU scores, can sometimes be better explained by differences in latent space properties than by VAE reconstruction accuracy alone.

[CV-78] PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models

链接: https://arxiv.org/abs/2606.22958
作者: Ruolan Sun,Pawel Polak
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Inference-time alignment of pretrained text-to-image models is typically performed along a single control axis, such as classifier-free guidance, attention editing, or reward-based latent perturbations. This limitation prevents modeling joint dependencies between conditioning and latent variables and hinders transfer across generative transports. We propose PG-MAP, a training-free framework that formulates inference-time alignment as a trajectory-level Gibbs-MAP / proximal energy optimization over the conditioning c and latent state z_t via a forward-consistency coupling, optionally guided by a frozen preference reward. This joint formulation enables coordinated updates across modalities while remaining compatible with both diffusion and flow-matching models through transport-specific adaptations. Across diffusion backbones (SD~1.5, SDXL), PG-MAP consistently improves alignment metrics such as PickScore and Aesthetic, and can be effectively combined with tuned classifier-free guidance to achieve the strongest overall performance. On flow-matching models (SD3.5-medium), the framework reduces to a latent-only variant, achieving \mathbf91.9% PickScore and 75.7% HPS win rates against a static baseline, with controlled experiments ruling out noise-related artifacts. Human evaluations further confirm consistent preference over strong baselines, including tuned CFG and compute-matched universal guidance. Finally, an oracle-routing analysis shows that the relative importance of conditioning and latent optimization depends on prompt types, surfacing further headroom that a per-prompt selector could exploit.

[CV-79] Evo-RAD: Navigating Rare Retinal Disease Diagnosis via Self-Evolving Agent ic Retrieval MICCAI2026

链接: https://arxiv.org/abs/2606.22955
作者: Wangding Xia,Ye Du,Jiashi Lin,Meng Wang,Danli Shi,Shujun Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026. 10 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large-scale pretrained foundation models have revolutionized general medical screening, but often falter on rare diseases because such conditions are underrepresented in real-world clinical datasets. While retrieval-augmented diagnosis attempts to mitigate this, conventional static methods frequently succumb to the hubness problem, retrieving visually similar but semantically incorrect common diseases. To address this, we propose Evo-RAD, a self-evolving agentic framework that transforms evidence acquisition into a dynamic decision-making task. We formulate retrieval as a Markov Decision Process (MDP) where a graphbased agent observes the reference set state and executes actions to purge discordant evidence (DELETE), acquire pathologically consistent samples (INSERT), or conclude the evolution (TERMINATE). Optimized via Group Relative Policy Optimization (GRPO) with a homogeneityaware reward, the agent learns to maximize the diagnostic homogeneity of the support reference set. Experiments on retinal disease benchmarks show that Evo-RAD substantially improves rare-disease diagnosis, outperforming retinal foundation models by +21.04%, while also surpassing retrieval-based and parameter-efficient fine-tuning methods by +3.56%. Code is available at this https URL.

[CV-80] ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

链接: https://arxiv.org/abs/2606.22948
作者: Yincheng Zhou,Athena Zhuoming Zhong,Shijie Zhang,Kevin Zhang,Teresa Xiaotao Shang,Shanghang Zhang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As multimodal agents move from interface understanding to real software control, successful trajectory discovery in live desktop environments becomes a key challenge. GUI tasks require long-horizon sequences of precise mouse and keyboard actions, while feedback is sparse, delayed, and costly to obtain through VM rollouts. We propose Environment-Native Verified Search (ENVS), a training-time search-and-filter pipeline that uses the environment to construct verified supervision before policy optimization: it branches over behaviorally distinct GUI actions in live OSWorld VMs, verifies successful leaves, and trains from globally balanced step-level supervision. To evaluate robustness under realistic desktop interruptions, we also introduce OSWorld-Noisy, a dynamic benchmark for recoverable desktop interruptions that preserves the original tasks while testing whether agents can refocus, dismiss, wait, or recover under live perturbations. On the 300-task OSWorld pool, ENVS reaches 30.3 pass@8 on original evaluations and 29.0 on OSWorld-Noisy, outperforming matched ARPO-style online RL while reducing compute from 184-192 to 138-153 GPU-hours; even with only 30% of its search data, ENVS reaches 27.0 pass@8, exceeding ARPO from the base model. Training from noisy environments also better preserves visual-reasoning abilities on auxiliary benchmarks, including OSWorld-G Refusal (16.7 vs. 1.9) and BLINK Functional Correspondence (26.2 vs. 23.1).

[CV-81] Controllable Texture Tiling with Transformed RoPE-Enhanced Diffusion Models

链接: https://arxiv.org/abs/2606.22945
作者: Junrong Huang,Zhiyuan Zhang,Rui Tang,Hongbo Fu,Jnig Liao
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: The code and dataset are publicly accessible at this https URL

点击查看摘要

Abstract:Realistic integration of user-specified textures into scene images is a fundamental task in computer graphics and image editing. While existing material transfer and reference-guided inpainting methods can edit surface appearances, they often fail to address the specific requirements of texture tiling. This task necessitates precisely repeating a reference pattern according to user-defined parameters such as frequency, orientation, and scale. Furthermore, current generative approaches often struggle to maintain the structural fidelity of the reference texture, limited by either destructive pixel-level resampling or the lack of fine-grained spatial information in semantic image encoders, and they frequently fail to preserve the coherent lighting and geometry of the original scene. In this paper, we propose a novel framework for controllable and high-fidelity texture tiling based on Diffusion Transformers. Our approach introduces two key technical innovations to decouple spatial manipulation from content generation. First, we propose a Coordinate-Transformed Rotary Embedding mechanism. By applying 2D affine transformations directly to the relative positional embeddings between the target latent and the image condition, we achieve precise control over tiling patterns without explicit pixel warping, thereby utilizing the full information of the reference condition without degradation. Second, a Disjoint Attention Mask is employed to shield reference features from semantic leakage. This preserves structural integrity while seamlessly blending the synthesized texture with the scene’s original lighting and geometry. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in both control accuracy and texture fidelity.

[CV-82] Evaluating self-supervised echocardiographic representations across downstream extraction strategies for left-ventricular segmentation and ejection fraction estimation

链接: https://arxiv.org/abs/2606.22943
作者: Sylwia Majchrowska,Philip Teare
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) is increasingly used in medical imaging to reduce annotation requirements, but representation quality is often judged using a single downstream evaluation setting. For dense clinical tasks, this can confound representation quality with the capacity of the downstream model used to recover task-relevant information. We present a systematic evaluation of self-supervised representations for left-ventricular segmentation and ejection fraction (EF) estimation from apical four-chamber echocardiography on EchoNet-Dynamic. Rather than relying on a single downstream probe, we compare a hierarchy of extraction strategies with increasing expressivity: heuristic extraction without mask-supervised training, frozen linear probes, frozen lightweight decoder probes, and partial fine-tuning. We apply this framework to two complementary representation families: generic frozen self-DIstillation with NO labels (DINOv3) features and a task-adapted dense self-supervised representation, Bootstrap Your Own Segmentation (BYOS). In both families, heuristic extraction substantially understated what was recoverable from the frozen representation. For DINOv3, performance improved from Dice 0.684 and EF mean absolute error (MAE) 13.01 under heuristic extraction to Dice 0.906 and EF MAE 9.65 with a frozen lightweight decoder, approaching a supervised U-Net baseline (Dice 0.915, EF MAE 9.72). For BYOS, performance improved from Dice 0.687 and EF MAE 17.83 under heuristic extraction to Dice 0.902 and EF MAE 8.74 with a frozen lightweight decoder. These results show that conclusions about self-supervised representation quality in dense echocardiographic analysis depend strongly on the downstream extraction strategy used for evaluation. We therefore argue that multi-strategy evaluation is an important methodological consideration for SSL in dense medical image analysis.

[CV-83] Hybrid Compression: Integrating Pruning and Quantization for Optimized Neural Networks

链接: https://arxiv.org/abs/2606.22935
作者: Minh-Loi Nguyen,Long-Bao Nguyen,Van-Hieu Huynh,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024

点击查看摘要

Abstract:Deep neural networks have witnessed remarkable advancements in recent years and have become integral to various applications. However, alongside these developments, training and deployment of neural network models on embedding and edge devices face significant challenges due to limited memory and computational resources. These problems can be addressed with deep neural network compression, which involves a trade-off between model size and performance. In this paper, we propose a novel method for model compression through two phases. First, we utilize model compression techniques, such as pruning and quantization, to significantly reduce the model size. Then, we use Mixture of Experts to route the previously compressed models to enhance performance while maintaining a balance in inference efficiency. MoEs consist of multiple expert models (i.e., compressed models) that are moderately sized and deliver stable performance. Experimental results on several benchmark datasets show that our method successfully compresses CNN models which achieves substantial reductions in FLOPs and parameters with a negligible accuracy drop.

[CV-84] BEV-Denoise: Learning Intrinsic Noise for Accurate Birds-Eye-View Semantic Segmentation

链接: https://arxiv.org/abs/2606.22931
作者: Dooseop Choi,Kyounghwan An,Kyoung-Wook Min
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present a framework dubbed \textbfBEV-Denoise that estimates and removes intrinsic noise from learned Bird’s-Eye-View (BEV) features to achieve accurate BEV semantic segmentation. Inspired by the noise estimation capability of Denoising Diffusion Probabilistic Models (DDPM), we design a UNet-based noise estimation module that learns to estimate the noise from the learned BEV features. The estimated noise is then subtracted from the BEV features and fed to BEV map decoders for the final prediction results. To facilitate supervision for the noise estimation module, we follow a sequential learning paradigm called Task Decomposition (TD) where a pre-trained BEV map autoencoder is employed to train a view transformation (VT) encoder. We share three key insights learned from our intensive experiments that are critical for improved performance. We apply our framework to four existing models, encompassing the three major VT paradigms. Experimental results on a large-scale real-world dataset, nuScenes, demonstrate the effectiveness of our framework.

[CV-85] MythraG en: Two-Stage Retrieval Augmented Art Generation Framework

链接: https://arxiv.org/abs/2606.22924
作者: Quang-Khai Le,Cong-Long Nguyen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024

点击查看摘要

Abstract:Text-to-image generation has seen rapid advancements, especially with the development of generative models. However, challenges remain in achieving high-quality, contextually accurate image outputs that faithfully match the provided textual descriptions, especially in artistic generation. In this paper, we present a simple yet efficient retrieval augmented generation framework, namely MythraGen, for text-to-artistic image generation by integrating an art retrieval mechanism with LoRA-based model fine-tuning. Our method extracts features from a large-scale art dataset, optimizing the generation process by combining artist-specific styles and content. Particularly, retrieved images from an external art database that have the highest similarity to the query prompt are used to finetune Stable Diffusion using LoRA for desired art generation. Experimental results and user studies on the WikiArt dataset show that our proposed method can generate artworks that closely match the user’s input, significantly outperforming existing solutions.

[CV-86] Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

链接: https://arxiv.org/abs/2606.22918
作者: Yu Cao,Ziquan Liu,Zhensong Zhang,Jiankang Deng,Shaogang Gong,Jifei Song
类目: Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Maintaining physical consistency in video generators and world models increasingly relies on vision-language models (VLMs) as automated judges that provide reward signals, ranking decisions, and data-filtering criteria. Yet VLMs differ substantially in training data and architecture, encoding physical phenomena through distinct internal representations. A single global evaluation schema therefore gives every VLM the same axes of competence, regardless of what each can actually perceive. We propose JudgeFit, an iterative refinement procedure that discovers a per-VLM evaluation taxonomy. An initial taxonomy is constructed by prompting the target VLM to enumerate physics errors on a small set of videos and clustering the resulting descriptions. The taxonomy is then refined through a diagnostic step: we calibrate the VLM’s per-dimension scores to human physical-commonsense ratings, diagnose which dimensions it scores unreliably or redundantly, and prompt an LLM to repair them, iterating until convergence. We further instantiate this procedure as a benchmark and apply it to 16 VLMs spanning eight model families. The refined taxonomy outperforms the global-schema baseline on held-out videos for every VLM tested, with a mean relative improvement of approximately 32%. Beyond aggregate accuracy, the per-VLM profiles expose model-specific blind spots that overall rankings cannot anticipate, with reliability patterns differing markedly across model families.

[CV-87] Intend Reflect Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving

链接: https://arxiv.org/abs/2606.22913
作者: Zisheng Chen,Yuping Qiu,Jianhua Han,Tao Tang,Xiuwei Chen,Likui Zhang,Ying-Cong Chen,Hang Xu,Xiaodan Liang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent Vision-Language-Action (VLA) models have advanced end-to-end autonomous driving by incorporating reasoning for better interpretability and planning quality. However, most existing approaches directly generate the final trajectory without explicitly examining its future consequences, which limits their reliability in complex and dynamic environments. To address this limitation, we propose IRR-Drive (Intend, Reflect, Refine), an adaptive multimodal reflection framework for autonomous driving. Specifically, to tightly couple high-level reasoning with physical constraints, IRR-Drive first generates a preliminary textual intention and anticipates potential interactions by predicting future semantic bird’s-eye view (BEV) representations. This dual-modality (Text + BEV) reflection space explicitly models anticipated scene evolution, enabling the model to rigorously self-correct and refine its initial intent before generating the final trajectory. Furthermore, to balance planning performance and computational efficiency, we construct reflection-oriented training data and design an adaptive reflection reward, enabling the model to adaptively select its reasoning mode according to scene complexity. Instead of using reasoning primarily as an auxiliary interpretation, IRR-Drive directly integrates an adaptive reflection mechanism into the planning framework, enabling grounded, decision-aware trajectory correction that is driven by scene complexity. Our method achieves state-of-the-art performance on the NAVSIM benchmark in both PDMS and EPDMS. Extensive experiments demonstrate the effectiveness of our multimodal reflection framework and validate the efficacy of the proposed adaptive reflection strategy.

[CV-88] Improving Robotic Imitation Learning via Trajectory Standardization

链接: https://arxiv.org/abs/2606.22907
作者: Licheng Yang,Lingfeng Qian,Fei Zheng,Yonghao He,Wei Sui,Shuangshuang Li,Hu Su
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imitation learning for robotic manipulation relies on large sets of human demonstration trajectories, which are often noisy and temporally irregular due to variable operator speed, intermittent pauses, and inconsistent action density. A common preprocessing strategy is time-uniform downsampling to shorten sequences, but it cannot effectively remove speed-induced non-uniformity or redundant pauses. This mismatch degrades data quality and hinders policy learning. To address this issue, we propose Information-Standardized Trajectory Resampling (ISR), an offline preprocessing method for effective imitation learning. ISR resamples each trajectory by enforcing approximately equal information distance between adjacent points. Specifically, we map trajectories onto an information-modulated Riemannian manifold and perform geodesic-equidistant parameterization. We construct an information-intensity field from velocity and acceleration norms: the velocity term removes small-motion redundancy, while the acceleration term preserves high-curvature and fine-manipulation phases. We evaluate ISR on three real-world manipulation tasks with mainstream imitation learning policies. Compared with the baseline time-uniform 3x downsampling, ISR improves task success rates by about 25%, remains robust across datasets collected from different operators, and reduces both dataset size and training cost. The code and videos are publicly available at this https URL.

[CV-89] InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

链接: https://arxiv.org/abs/2606.22905
作者: Quanyue Song,Yishan He,Yanfei Zhang,Shihao Cheng,Zhixiang He,Zhizhi Guo,Chi Zhang,Xuelong Li,Caigui Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.

[CV-90] PHOEBI: An Open-World Benchmark for Bacterial Identification in Phase-Contrast Microscopy

链接: https://arxiv.org/abs/2606.22890
作者: Aaditya Baranwal,Md Jahid Hasan,Shruti Vyas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical microscopy enables rapid, label-free imaging of live bacteria and is the standard instrument for species identification across clinical, environmental, and industrial microbiology. Yet field samples are routinely polymicrobial and may contain organisms that were never seen during system training, and no computer-vision benchmark tests multi-label species identification from phase-contrast microscopy (PCM) of such mixtures. We introduce Phase-contrast Optical bEnchmark for Bacterial Identification ( \textbfPHOEBI ), a wet-lab-prepared dataset of 120,000 PCM images covering 40 combinations of six rod-shaped species, paired with a leave-combinations-out (LCO) evaluation protocol that holds out entire species combinations to mirror the practical scenario of a model trained on catalogued mixtures that must generalise to unseen ones. On LCO, every gradient-trained per-image aggregator we test drops 0.39 to 0.57 F1 from the in-distribution to the held-out split, a systematic open-world recognition failure in the aggregator, not the visual representation. A linear probe of thirteen different encoders over the same features spreads only about six percentage points of F1 across general-purpose and biomedical pretraining objectives, confirming the representation is sound. We propose three lightweight \textitanchor-based decoders that capture per-species presence geometrically over a shared frozen tile-feature pool, scoring \textithigher on held-out combinations than on in-distribution validation.

[CV-91] Full-Body Golf Swing Kinematic Reconstruction From a Smartwatch IMU

链接: https://arxiv.org/abs/2606.22876
作者: Yuanshuo Tan,Kezhe Zhu,Xiujie Sun,Chunping Liang,Shuoyang Zhu,Chenquan Xu,Licheng Zhong,Huiming Pan,Yinri Jin,Chang Liu,Bo Xiao,Shenglong Le,Bryndan W. Lindsey,Peter B. Shull
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative measurement of the golf swing is critical for evaluating technique and enabling individualized feedback. However, existing methods are impractical to use on the golf course: optical motion capture is laboratory-bound, camera-based methods require impractical camera placement, and multi-sensor inertial measurement unit (IMU) systems require multi-segment setup and calibration. We thus propose a single wrist-worn IMU approach for estimating full-body joint angles during golf swings. The proposed Wrist-IMU Temporal Kinematic Network (WIT-KinNet) leverages modality-specific IMU embeddings and temporal kinematic encoding to learn wrist-to-body motion dependencies and estimate full-body joint angles during golf swings. Thirty-six golfers spanning beginner and skilled players, performed full, half, and quarter swings using seven club types: driver, 3-wood, 5-hybrid, 5-iron, 7-iron, 9-iron, and sand wedge. The proposed WIT-KinNet was evaluated under subject-wise cross-validation using synchronized smartwatch IMU data and ground-truth kinematics derived from an optical motion capture system. The proposed approach achieved a mean absolute error of 8.11 \pm 1.84 ^\circ across full-body joint angles. High temporal correlation was observed for pelvic rotation and upper torso rotation (r = 0.98 and 0.97, respectively), with X-factor and S-factor also showing strong correlation (r = 0.96 and 0.96). Linear mixed-effects models of the error revealed that swing amplitude, skill level, and club type all significantly affected measurement differences (p 0.05). The results establish the first single wrist-worn IMU approach for estimating full-body golf swing kinematics, enabling practical swing analysis during real gameplay.

[CV-92] FedOT: Ownership Verification and Leakage Tracing via Watermarks for Federated LDMs ECCV2026

链接: https://arxiv.org/abs/2606.22875
作者: Wenlong Cheng,Yuan Gan,Yunqiu Xu,Jiaxu Miao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Training Latent Diffusion Models (LDMs) within Federated Learning (FL) has attracted increasing attention due to its ability to combine the powerful generative capacity of LDMs with the privacy-preserving properties of FL. However, FL requires sharing the global model with multiple participants, which risks unauthorized model distribution or resale by malicious clients. While an intuitive approach is to adopt existing VAE-based watermarking techniques for LDMs in FL, this strategy falls short in addressing such threats due to two fundamental challenges: (1) Existing methods support ownership verification but lack the ability to trace model leakage to a specific malicious client; (2) VAE-based watermarks are vulnerable, as they can be removed simply by replacing the decoder with a clean counterpart. In this paper, we propose FedOT, the first framework for ownership verification and leakage tracing in federated LDMs. Specifically, to address the first challenge, we design a chunked watermark, where the first part is for ownership verification, and the second part is used for client identification. Furthermore, to overcome the second challenge and secure the model against VAE replacement attack, we introduce Latent Vector Transformation (LVT), which strengthens the connection between the VAE and U-Net latent spaces by modifying the original latent distribution of the VAE. Consequently, any attempt to replace the VAE for watermark removal leads to significant image quality degradation, making the LDM model unusable. Extensive experiments demonstrate that FedOT achieves superior performance in both ownership verification and traceability. Project page: this https URL.

[CV-93] Fursee: Hybrid YOLO-DINOv3 Framework for Fursuit Identity Retrieval and Clustering

链接: https://arxiv.org/abs/2606.22872
作者: Jundi Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Global furry conventions produce massive fursuit photographs, while manual sorting brings heavy labor costs and calls for automatic identity retrieval and clustering solutions. General multimodal models lack dedicated optimization for complex fursuit scenes, and no public benchmark dataset exists for this task. To fill this gap, we build a specialized fursuit image dataset and present a three-stage hybrid pipeline Fursee for fursuit identity retrieval and clustering. First, YOLO detects and crops high-resolution fursuit head patches to improve localization of small and overlapping targets. Second, ArcFace optimizes DINOv3 embeddings to enlarge angular separation between different identities on the feature hypersphere. Third, DBSCAN performs unsupervised clustering, with silhouette-coefficient-driven search automatically selecting optimal hyperparameters rather than fixed manual radius. Retrieval and clustering experiments verify that our pipeline outperforms mainstream multimodal models including GPT5.5, Claude Opus 4.8 and Qwen3.7-Plus on all evaluation metrics, achieving competitive performance for fursuit head retrieval and grouping.

[CV-94] VideoLatent: Video-Language Learning via Latent Self-Forcing

链接: https://arxiv.org/abs/2606.22870
作者: Zi-Yuan Hu,Zicong Tang,Shijia Huang,Yanyang Li,Michael R. Lyu,Liwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in chain-of-thought (CoT) reasoning have shown promise in enhancing video understanding and reasoning capabilities of multimodal large language models (MLLMs). However, existing CoT-based MLLMs require labor-intensive CoT annotations and incur substantial training and inference overhead. While visual latent reasoning has emerged as a more efficient alternative, existing methods primarily focus on image tasks and heavily rely on additional supervision signals for visual latent generation (e.g., CoT traces, auxiliary images, or fine-grained annotations), limiting their scalability and transferability to video tasks. To bridge this gap, we introduce VideoLatent, a novel MLLM equipped with a latent injection module tailored for video understanding and reasoning. Specifically, VideoLatent learns to perform visual latent reasoning using a new latent self-forcing training paradigm, which comprises latent alignment and latent diversity objectives, and relies solely on standard video-question-answer triplets. Extensive experiments across 14 benchmarks demonstrate that our model consistently outperforms existing standard and latent MLLMs on general video understanding and complex video reasoning. Compared with Video-R1, our VideoLatent achieves superior computational efficiency, reducing training/inference overhead by \sim 6 \times / \sim 68 \times . Moreover, experiments demonstrate that our method has strong generalizability to different MLLM backbones and different model scales.

[CV-95] Chains That See Answers That Dont: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME SIGIR2026

链接: https://arxiv.org/abs/2606.22862
作者: Zhichao Fan,Yanhang Li,Zexin Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures. To appear at The 2nd Workshop on Evaluation for Multimodal Generation @ SIGIR 2026 (EvalMG '26)

点击查看摘要

Abstract:Forced chain-of-thought (CoT) is widely assumed to make vision-language models more reliable on video question answering. We propose a small three-probe evaluation recipe to test that assumption: paired accuracy across direct, CoT, answer-first, and no-video conditions; a counterfactual video-swap diagnostic over the CoT chains; and a four-rung visual-degradation ladder. Each probe is reported under both a strict and a permissive regex scorer, with multiplicity correction over a manuscript-declared primary family. Applied to Qwen2.5-VL on Video-MME subsets, the recipe returns a two-part finding. The CoT chains are strongly video-conditioned: swapping the input video collapses chain overlap and flips most final letters, the opposite of what a “boilerplate-chain” null would predict. Yet on the same data, forced CoT does not improve MCQ accuracy, and on the smaller 7B model it produces a small but statistically supported drop under a post-hoc primary scorer choice. We do not claim this generalizes beyond the Qwen2.5-VL / Video-MME instantiation; the raw responses and a single recomputation script will be released with the supplementary material so every number can be re-derived.

[CV-96] G-MASt3R-SfM: Graph-based View Pruning and Multi-stage Optimization for Robust SfM ICIP2026

链接: https://arxiv.org/abs/2606.22856
作者: Toshiki Watanabe,Shintaro Ito,Natsuki Takama,Koichi Ito,Takafumi Aoki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ICIP2026

点击查看摘要

Abstract:Structure from Motion (SfM) is essential for multi-view 3D reconstruction, however, its accuracy heavily relies on the accuracy of image matching. While the recent correspondence matching method, MASt3R, enables robust matching even under challenging conditions, it tends to generate incorrect correspondences for non-overlapping image pairs. Consequently, existing SfM methods using MASt3R, such as MASt3R-SfM, suffer from significant degradation in pose estimation accuracy as they incorporate these unreliable matches directly into optimization. To address this issue, we propose G-MASt3R-SfM, a novel SfM pipeline that enhances robustness through two key modules. First, the Graph-based View Pruning (GVP) module constructs a scene graph from matching confidence and geometrically prunes outlier views. Second, the Multi-Stage Optimization (MSO) module progressively refines camera parameters by expanding the optimization scope from local consistency to the global consistency. Experiments on the ETH3D dataset demonstrate that our method achieves state-of-the-art accuracy in both camera pose estimation and 3D reconstruction, effectively suppressing noise caused by outliers.

[CV-97] OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

链接: https://arxiv.org/abs/2606.22835
作者: Zijie Meng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by SCA2026(poster)

点击查看摘要

Abstract:Controllable video generation demands independent command of the camera and the subject, yet 2D conditioning entangles them: camera- and object-induced optical flow share the same inverse-depth (1/Z) scaling and cannot be separated from image evidence alone. We first prove that this entanglement is representational, not architectural – the 2D camera/object split is a non-identifiable inverse problem – and therefore reframe decoupling as a question of operator design. We resolve it at the level of the attention operator. OrthoMotion routes camera motion into a geometric channel, a norm-preserving rotation of the rotary position embedding (RoPE) phase, and subject motion into a semantic channel, a gated value injection in cross-attention. Because these sub-operators are algebraically complementary – a rotation versus a translation of the affine action on tokens – a lightweight decoupling regularizer provably drives their response subspaces to orthogonality, so the two controls stop interfering. To our knowledge OrthoMotion is the first method to guarantee disentanglement by construction rather than hope for it to emerge. It attains state-of-the-art camera and subject accuracy at once while minimizing cross-talk, which we quantify with a new Cross-Talk Error (CTE) metric, cutting cross-talk by more than 2.4x with no loss in fidelity and generalizing across backbones.

[CV-98] Homographic Navigation: Geometry-Driven Camera Guidance for Deterministic Planar Capture

链接: https://arxiv.org/abs/2606.22834
作者: Dominik Kroupa,Marek Vaško,Muh Yuzril Ihza Baharuddin,Adam Herout
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present homographic navigation, a geometry-centric framework for guiding camera acquisition toward precise capture of planar regions. Rather than treating homography as an output, we use it as an organizing variable that unifies learning, alignment, and evaluation. From a single annotated reference image, we generate unlimited synthetic training data via homographic augmentation and train a single-shot model for joint recognition and localization of multiple artifacts (physical objects with a rectangular planar target) through sparse keypoint prediction. To address precision under limited model input resolution, we introduce a two-pass inference scheme with global detection followed by localized refinement, and a Stable Warp training strategy that significantly improves accuracy, particularly in the high-precision regime. The model also predicts confidence estimates per predicted keypoint and per the whole sample. Experimental results demonstrate that accurate planar alignment can be achieved from minimal supervision, providing a foundation for geometry-driven camera guidance and future learning from in-the-wild video data.

[CV-99] DBT-Bleed: Dual-Branch Temporal Modeling with Key-Frame Selection for Surgical Bleeding Detection

链接: https://arxiv.org/abs/2606.22829
作者: Sudhanshu Mishra,Jialang Xu,Jensen Ang,Evangelos B. Mazomenos,Beng Ti Ang,Yueming Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Intraoperative Adverse Events (IAEs) detection is critical for improving surgical safety, with bleeding being among the most frequent events across many surgery types. Existing methods struggle to distinguish bleeding IAE from visually similar residual blood due to limited temporal reasoning. Moreover, modeling long surgical videos while preserving fine-grained temporal dynamics remains computationally challenging. We propose DBT-Bleed, a dual-branch multi-scale temporal modeling framework disentangling bleeding and normal representations using layer-wise temporal adapters for short- and long-term bleeding progression. To efficiently process long surgical videos without sacrificing fine-grained temporal information, we introduce HiRED, a Hierarchical Entropy-Driven frame selection strategy that retains temporally informative segments while removing redundancy. Experiments on the MultiBypass dataset demonstrate gains of 6.53% in F1, 5.62% in Recall and 9% in MCC values for bleeding IAE detection, consistently outperforming video-level baselines. Additionally, we evaluate cross-procedure generalization on a newly curated dataset from a different surgical procedure type, where DBT-Bleed demonstrates robust transferability by achieving gain of 6% in F1 and 8% in MCC under zero-shot setting. To support this evaluation, we introduce EndoPit-IAE, an Endonasal Pituitary Surgery dataset annotated for IAEs, representing the first IAE-annotated dataset in neurosurgery. Code will be made publicly available upon acceptance.

[CV-100] Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics

链接: https://arxiv.org/abs/2606.22806
作者: Shujia Li,Jianshu Hu,Haiyu Zhang,Yunpeng Jiang,Haoyuan Jin,Xinyuan Chen,Yaohui Wang,Yutong Ban
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing realistic Human-Object Interactions (HOI) is critical for creating embodied avatars and functional virtual environments. However, current data-driven approaches primarily rely on motion capture datasets, which are expensive to scale and limited in functional diversity. Models trained with these datasets fail to generalize to unseen objects and maintain physical consistency over long horizons. In this paper, we propose a novel framework that leverages a physics simulator to overcome the data-scarcity bottleneck in HOI generation. Specifically, we propose a scalable pipeline, called \ours, which leverages policies trained with reinforcement learning in a physics simulator for task-oriented data generation and trains a generative model on the augmented dataset for generalizable HOI generation. To seamlessly utilize the synthetic data, we introduce a coarse-to-fine retargeting process that bridges the representation gap between the simplified model used in physics simulator and the standard parametric body models required for generative training. Validated through comprehensive experiments, our method demonstrates enhanced generalization to unseen objects and the capability of long-horizon generation, while exhibiting greater dynamic diversity and physical plausibility.

[CV-101] CoVStream: Edge-Cloud Collaboration for Understanding of Long Video Streams

链接: https://arxiv.org/abs/2606.22804
作者: Xu Liu,Guikun Chen,Zihao Yan,Kanzhi Wu,Wenguan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Long, continuous video streams are an increasingly critical driver of multimedia intelligence. Existing efforts often handle long videos with a sample-encode-reason approach using large models. However, they overlook a crucial deployment fact: the stream is often produced by computationally constrained devices. This forces an untenable compromise: cloud offloading unlocks strong reasoning but incurs prohibitive bandwidth overhead, while on-device processing remains limited by edge hardware capacity. Therefore, we propose CoVStream, the first edge-cloud collaborative framework for understanding long video streams. The edge node distills raw video streams into compact visual features and semantic captions for transmission to the cloud, minimizing bandwidth costs, while the cloud server integrates this data into an entity graph and global visual context, activating the heavy reasoning model only when a user query arrives. Experiments on VideoMME-Long, LVBench, and RTV-Bench show that CoVStream reduces bandwidth usage by 87.6% while retaining 99.2% of the cloud baseline accuracy on LVBench.

[CV-102] Learning Adaptive Dynamical Features via Multi-τ Liquid-Mamba for All-in-one Image Restoration

链接: https://arxiv.org/abs/2606.22801
作者: Hu Gao,Changshuo Wang,Yulong Chen,Lizhuang Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration aims to recover high-quality images from degraded observations. Recent Mamba-based image restoration models have demonstrated strong potential in modeling long-range dependencies with linear complexity. However, most existing designs still rely on a single state-evolution timescale, which limits their adaptability to spatially heterogeneous and task-dependent degradation patterns in all-in-one image restoration. In this paper, we propose Multi- \tau Liquid-Mamba, an adaptive state space module that introduces input-conditioned multi-timescale liquid discretization into selective state space modeling. Instead of changing the overall selective scan pipeline, the proposed module modulates the effective discretization steps of multiple dynamical branches and adaptively fuses their responses according to degradation-aware gating weights. This design allows the model to capture both fast-varying local details and slowly evolving global structures while preserving the linear scaling property of Mamba with respect to sequence length. Importantly, Multi- \tau Liquid-Mamba modulates the effective transition dynamics while preserving the original selective parameterization and hardware-efficient selective scan mechanism, making it a plug-and-play module that can be seamlessly integrated into existing Mamba-based architectures. Built upon this framework, we develop a Multi- \tau Liquid-Mamba Image Restoration Network (MLMIR) for all-in-one image restoration. Extensive experiments on a wide range of restoration benchmarks demonstrate that MLMIR consistently achieves state-of-the-art performance in all-in-one image restoration while remaining highly competitive in task-aligned restoration settings.

[CV-103] Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction

链接: https://arxiv.org/abs/2606.22787
作者: Tianbo Pan,Xingyi Yang,Shizun Wang,Xinchao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but rely on a restrictive static assumption: the scenes is entire distractor-free with perfect cross-view geometry. This reliance on idealized inputs causes even the most advanced methods to fail in real-world settings, where transient distractors and occlusions present. To address this, we propose Visual Geometry Transformer in the Wild (VGTW), an end-to-end framework for robust reconstruction from inconsistent views. At its core, we isolate and suppress distractor-affected regions while preserving the consistent components across views. Specifically, we introduce a Distractor-aware Training (DAT) strategy that separates clean features from distractor-contaminated ones in the attention mechanism while enforcing feature consistency across images. To enable this, we train the model with an auxiliary mask prediction head, using supervision from a new dataset we collected with pixel-level distractor masks. The resulting VGTW model is a feed-forward network that directly outputs clean, distractor-free point clouds. Remarkably, it requires no additional 3D supervision, remains computationally efficient, and is compatible with existing pipelines. Extensive experiments validate our approach, demonstrating state-of-the-art performance and robust generalization in diverse, real-world scenarios.

[CV-104] DE-FIVE: Detecting Malicious Image Prompts via Fourier Features and Image Vector Embeddings

链接: https://arxiv.org/abs/2606.22779
作者: Xingwei Zhong,Varun Sharma,Kar Wai Fok,Vrizlynn L. L. Thing
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) employ both visual and textual modalities to enable advanced vision-language inference. However, incorporating visual modalities expands the attack surface of VLMs, making them more susceptible to security threats such as adversarial perturbations and indirect prompt injection, wherein crafted malicious image prompts can elicit unintended model outputs. Existing defense methods against malicious image prompts remain insufficient as they typically demand extensive datasets for retraining or the deployment of additional, complex classifiers. Most critically, there is a profound lack of specialized defense mechanisms specifically targeting indirect prompt injections, a gap that serves as a primary motivation for this work. To address these limitations, we introduce DE-FIVE, a novel training-free framework for detecting malicious image prompts by leveraging Fourier features and the hidden state representations of the visual encoder (image vector embeddings) across perturbations. Specifically, we develop a hybrid detection strategy consisting of a black-box detector that operates on Fourier-domain features and a white-box detector that exploits image vector embeddings derived from only a few-shot malicious set. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines against malicious image prompts.

[CV-105] LoCC: Detection and Localization of Lip-Syncing Deepfakes via Counterfactual Frame Consistency ICME

链接: https://arxiv.org/abs/2606.22772
作者: Soumyya Kanti Datta,Shan Jia,Siwei Lyu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Conference on Multimedia and Expo (ICME) 2026

点击查看摘要

Abstract:Lip-syncing deepfakes are among the most challenging forms of manipulated media because their artifacts are localized almost exclusively to the mouth region and evolve dynamically over time. Detecting such deepfakes requires precise temporal and spatial modeling of lip motion. In this paper, we propose LoCC, a novel detection framework that performs fine-grained detection and localization of lip-syncing deepfakes at both segment and frame levels. Unlike prior approaches that analyze videos holistically, our method evaluates whether each frame aligns with a counterfactual estimate generated from its temporal neighbors. Real videos exhibit strong and stable consistency, whereas lip-sync deepfakes introduce localized inconsistencies. Following a teacher-student learning paradigm, our model effectively captures these frame-level discrepancies and achieves superior performance over state-of-the-art methods on multiple benchmark lip-syncing deepfake datasets, including LAV-DF, AVDF1M, FakeAVCeleb, and KODF, and generalizes well across compression levels and datasets.

[CV-106] READ More than What You See: Reinforcement Learning for Accurate and Coherent Audio Description Generations

链接: https://arxiv.org/abs/2606.22766
作者: Bo Fang,Xinyao Zhang,Yuxin Song,Hui Zhang,Hang Zhou,Antoni B. Chan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio Description aims to generate concise narrations of essential visual content in audio-visual media for blind and low-vision audiences. Existing methods either rely on prompting off-the-shelf multimodal models, which often mismatch AD style, or partially optimize training-based systems with next-token prediction, which under-explores model capacity and biases generation toward generic expressions. We present READ, the first reinforcement-learning (RL) framework for training-based AD generation. READ formulates AD as sequence-level optimization with reference-matching, length, and format rewards, and further introduces a dedicated coherence reward under context-aware supervision to promote narratively coherent descriptions. Experiments on MAD-Eval, CMD-AD, and TV-AD show that READ substantially outperforms prior methods across diverse evaluation metrics. Our results highlight RL as a promising paradigm for accurate and coherent AD generation. Our codes, models, and benchmark results will be publicly available.

[CV-107] RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation ECCV2026

链接: https://arxiv.org/abs/2606.22749
作者: Yuchuan Ding,Linfei Li,Lin Zhang,Ying Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Pre-trained Vision Foundation Models (VFMs) have become central to modern computer vision due to their powerful semantic representations and strong generalization ability. However, their patchified or pooled outputs are inherently low-resolution, limiting their effectiveness in tasks requiring fine-grained, pixel-level reasoning. Existing feature upsampling approaches either degrade semantic fidelity or rely on VFM-specific retraining and heavy architectures, hindering efficiency and scalability. To address these challenges, we propose RaysUp, an ultra-lightweight, task-agnostic, and VFM-agnostic feature upsampling framework that reconstructs high-resolution feature maps at arbitrary resolutions. Unlike conventional 2D interpolation or attention-based schemes, RaysUp lifts feature reconstruction into a geometry-aware ray domain. Specifically, we introduce a Spatially Decoupled Guidance Encoder for direction-aware guidance encoding, an Any-Resolution Cross-Attention mechanism for resolution-flexible reconstruction, and a novel Ray Positional Encoding (RayPE) that injects implicit 3D geometric priors via 6D Plucker ray coordinates. Finally, a Geometry-Aware Neighborhood Attention module further ensures content-adaptive bilateral aggregation while preserving geometric consistency. Extensive experiments across diverse dense prediction tasks demonstrate that RaysUp achieves state-of-the-art performance while using only 16% of the parameters of AnyUp and delivering approximately 7x faster inference. These results highlight a substantially improved accuracy-efficiency trade-off and establish RaysUp as a practical and scalable solution for universal feature upsampling. Code is available at this https URL.

[CV-108] Interpretable Uncertainty Routing Separating Emotion Ambiguity from Distribution Shift in Facial Expression Recognition

链接: https://arxiv.org/abs/2606.22725
作者: Keito Inoshita,Takato Ueno
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) is inherently ambiguous: human annotators frequently disagree, and models deployed in real environments face distribution shift. Crucially, these two conditions demand different downstream actions, as ambiguous in-distribution faces should be reported with their ambiguity whereas out-of-distribution inputs should be rejected. However, a single uncertainty score conflates the two. In this study, uncertainty decomposition into aleatoric and epistemic components for FER is investigated, and Uncertainty-Aware Routing (UAR), an inference-time routing mechanism that exploits the separation, is introduced. Specifically, aleatoric and epistemic uncertainties are obtained from a Deep Ensemble of fully fine-tuned DINOv2 models and are each validated against an independent external signal: aleatoric against human annotator disagreement, and epistemic against distribution shift induced by image corruptions. The proposed dual-validation protocol reveals that aleatoric recovers annotator disagreement with Spearman correlation 0.66 (95% CI: 0.64-0.68), and epistemic detects corruption-induced shifts, achieving average AUROC of 0.699 at the highest corruption severity. UAR retains approximately 1.8 times more ambiguous in-distribution faces than single-uncertainty routing at a matched out-of-distribution rejection rate. A strong label-distribution-learning baseline achieves comparable disagreement recovery but cannot separate ambiguity from shift and therefore cannot route, establishing that the value of decomposition lies in the separation enabling interpretable and differentiated action selection.

[CV-109] Generative Relightable Avatars

链接: https://arxiv.org/abs/2606.22718
作者: Kunwar Maheep Singh,Christian Theobalt,Rishabh Dabral
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present Generative Relightable Avatars (GRA), a person-specific method for photorealistic free-view rendering and environment-map relighting of full-body humans. We postulate that modeling fine-grained appearance details is inherently a one-to-many problem that can benefit from a generative formulation. In contrast to fully regressive relightable avatar methods, GRA follows a hybrid approach that combines controllable, physics-grounded relighting with probabilistic refinement. Starting from a tracked animated mesh, we optimize material parameters in UV-space and render a coarse relit appearance under a target HDR environment map. Next, we refine the textures with a feed-forward model to capture pose-dependent texture dynamics and illumination effects beyond simplified reflectance assumptions. Finally, a fine-tuned video-to-video diffusion model transforms the physically grounded renderings into temporally coherent, high-detail videos while preserving 3D control, with an error-recycling strategy for generating long videos. Experimental evaluations demonstrate our method’s improved perceptual quality over prior relightable avatar baselines. Project Page: this https URL

[CV-110] Modular Diffusion Models for Structured Visual Recognition

链接: https://arxiv.org/abs/2606.22702
作者: Siddhesh Khandelwal,Björn Ommer,Leonid Sigal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Traditional supervised methods for structured visual recognition tasks – such as object detection, segmentation, and scene graph generation – often produce deterministic, fixed outputs, limiting their ability to capture the inherent uncertainty in complex visual scenes. As a consequence, such point estimates are unable to capture the prediction uncertainty (or multi modality) intrinsic to these problems, often arising from natural ambiguities (e.g., ambiguity in size of partially occluded objects, local ambiguity of exact segmentation boundary, etc.) as well as noise and sparsity of training data. To address this limitation, we present Modular Diffusion Models (MDMs), a simple and novel framework that learns a distribution over structured outputs for a given input image. MDMs decompose the diffusion process into distinct, task-specific modules, each focused on capturing a different aspect of the structured information space, such as object categories, spatial locations, and inter-object relationships. This modular design allows each component to be learned independently, with seamless integration at inference without additional training. Furthermore, the modularity of MDMs enables the diffusion process to easily operate over the heterogeneous output space common in many structured learning tasks (e.g., a continuous bounding boxes and discrete class labels). Experimental results over three distinct structured tasks – object detection, instance segmentation, and scene graph generation – highlight the benefits of our proposed framework.

[CV-111] SCRUB-FL: Sanitizing and Cleansing Representations via Unlearning of Backdoors

链接: https://arxiv.org/abs/2606.22700
作者: Osama Wehbi,Sarhad Arisdakessian,Omar Abdel Wahab,Azzam Mourad,Hadi Otrok
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 tables, 1 algorithm, 4 figures

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training without sharing raw data, making it a promising paradigm for privacy-sensitive applications. However, its decentralized nature makes it inherently vulnerable to backdoor attacks, where malicious clients embed hidden triggers into local training data to manipulate model predictions. Existing defenses mainly operate during before and during aggregation cannot fully eliminate backdoor behaviors that persist in the converged global model. Moreover, the effectiveness of post-training sanitization is often limited by the server’s lack of knowledge of trigger patterns or poisoned clients after convergence, resulting in residual backdoor behaviors or accuracy degradation due to neuron entanglement. To address this limitation, we propose SCRUB-FL (Sanitizing and Cleansing Representations via Unlearning of Backdoors), a two-phase solution for post-training backdoor removal in FL. During training, clients identify suspicious samples using spectral analysis and activation clustering, then train lightweight Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) models to capture trigger-related distributions. The generator parameters are aggregated server-side to construct a global representation of suspicious patterns without exposing raw data. After convergence, the server synthesizes trigger-approximating samples and applies machine unlearning to erase the trigger-target association by redistributing predictions toward a uniform distribution. Experimental evaluations on CIFAR-10 and GTSRB across three attack types and up to 40% malicious participation demonstrate that SCRUB-FL reduces the backdoor attack success rate to as low as 3.88% while maintaining over 91% normal task accuracy, outperforming state-of-the-art defenses without requiring prior trigger knowledge or a large clean proxy dataset at the server.

[CV-112] Catching Lies Without Sending the Video: Privacy-Preserving Multimodal Deception Detection

链接: https://arxiv.org/abs/2606.22699
作者: Nikita Sharma,Pranav Sara,Karan Singla
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Frontier multimodal models can guess whether a person is lying from a testimony video. To do so, they stream that raw face and voice to a third-party model. We ask whether the heavy media is needed at all. On the Real-life Trial Deception dataset, Whissle on-device speech and vision stack extracts a compact digest: transcript, emotion, age, gender, intent distributions, a deception intent filter, fluency and rhythm, per-frame facial behaviour, and prosody. Under speaker-independent evaluation, we report three findings. A small classifier on this digest reaches AUC 0.741, matching Gemini 2.5 Pro on full video. Handing the digest to a frontier LLM reaches AUC 0.755 with Claude Opus 4.8 at 7.8X fewer input tokens, with no media leaving the device. The reported 75% accuracy is a speaker-leakage artifact. We release code and experiments.

[CV-113] NullFlow: One-Step Generative Reconstruction

链接: https://arxiv.org/abs/2606.22696
作者: Xiao Shi,Edward P. Chandler,Chicago Y. Park,Shirin Shoushtari,Ulugbek S. Kamilov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures. Xiao Shi and Edward P. Chandler contributed equally

点击查看摘要

Abstract:We propose NullFlow, a principled framework for one-step generative image reconstruction. Our key idea is to confine the generative flow to a measurement-consistent subspace. Because the flow never leaves this subspace, NullFlow needs no separate data-fidelity corrections, unlike existing solvers. NullFlow samples in a single network evaluation by learning the flow’s average velocity, avoiding the step-by-step integration of traditional flow matching methods. We prove that the average velocity of this constrained flow yields a training objective whose global minimizer is a one-step posterior sampler. We show on image inpainting that NullFlow matches state-of-the-art diffusion solvers while cutting inference from hundreds of network evaluations to one.

[CV-114] SATURN: Symbolic Spatial Reasoning for Multi-Perspective Grounding

链接: https://arxiv.org/abs/2606.22694
作者: Danial Kamali,Tanawan Premsri,Shreya Rajpal,Amir Zadeh,Chuan Li,Parisa Kordjamshidi
类目: Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) remain unreliable when spatial reasoning requires composing relations whose meanings depend on frames of reference. Existing neuro-symbolic methods make reasoning more explicit, but often depend on brittle geometric procedures and hard decisions over noisy perception. We propose SATURN, a neuro-symbolic framework for perspective-aware compositional spatial reasoning. SATURN reconstructs an approximate 3D scene, derives soft perspective-aware spatial predicates, and composes them with a training-free Pythonic symbolic executor, separating perception from reasoning while preserving uncertainty through multi-hop inference. We also introduce 3D FORCE, a diagnostic benchmark that controls reasoning depth, view, and perspective composition across spatial arrangement grounding (SAG) and referring expression grounding (REF). On 3D FORCE, VLMs and spatially trained models degrade sharply as depth and perspective complexity increase, whereas SATURN remains stable and outperforms strong baselines. On the real-world MindCube benchmark, SATURN achieves 78.57% overall accuracy, outperforming the strongest baseline by 14 pp.

[CV-115] Prompting Diffusion Models for Zero-Shot Instance Segmentation

链接: https://arxiv.org/abs/2606.22660
作者: Irem Zeynep Alagöz,Nils Morbitzer,Andrea Ramazzina,Nassir Navab,Federico Tombari,Stefano Gasperini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Several disruptive research directions have recently emerged in computer vision, including foundation models achieving previously unseen zero-shot performance in scene understanding, even interactively, and generative models that synthesize extremely realistic images. The latter have also been shown to be highly effective in scene understanding tasks thanks to their rich priors. However, for promptable segmentation, foundation models struggle with accurately segmenting an object’s region, leading to false positives and over-segmentation. Notably, early attempts that leverage generative priors use prompts only during post-processing, yielding suboptimal segments because the process is agnostic to the user input. In this paper, we target these limitations with Prompt2Seg, a spatial conditioning framework for diffusion-based segmentation. Prompt2Seg augments a frozen diffusion segmentation model with a conditioning branch. Our approach takes spatial prompts, represented as 2D Gaussians or confidence maps, as explicit input signals, training the model to respond directly to user intent. Fine-tuned on a deliberately constrained set of object categories drawn from Hypersim and Virtual KITTI 2, Prompt2Seg generalizes zero-shot to a wide range of unseen object types and visual domains. We evaluate on seven datasets ranging from standard benchmarks to more challenging domains, including paintings, egocentric views, and X-ray data. Furthermore, we demonstrate that Prompt2Seg consistently outperforms the underlying diffusion segmentation backbone across all benchmarks. Our results suggest that the rich priors encoded in generative pretraining, combined with principled spatial conditioning, offer a compelling path toward broadly generalizing interactive segmentation without large-scale mask supervision.

[CV-116] MaRS: Robust Out-of-Distribution Detection via Mahalanobis Residual Scoring MICCAI2026

链接: https://arxiv.org/abs/2606.22649
作者: Francesco Di Salvo,Sebastian Doerrich,Christian Ledig
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Foundation models provide highly descriptive representations for medical images, yet their reliability degrades under distribution shifts arising from changes in patients, devices, or acquisition conditions. Reliable out-of-distribution (OOD) detection is therefore essential for safe deployment. Recent post-hoc detectors efficiently exploit frozen embeddings (\emphe.g., kNN), whereas reconstruction-based OOD detection in latent feature space has seen limited adoption due to inconsistent performance. In this work, we show that the limitation of reconstruction-based methods in latent space does not stem from poor reconstruction quality, but from how reconstruction errors are scored. Standard L_2 residual norms collapse the anisotropic residual structure, thereby suppressing informative deviations. To address this limitation, we introduce \textttMaRS (Mahalanobis Residual Scoring), a label-free OOD detector that learns an in-distribution manifold using a lightweight autoencoder and measures deviation via a Mahalanobis distance on reconstruction residuals, yielding variance-aware OOD scores. Across three imaging modalities, multiple types of distribution shift, and different model families and scales, \textttMaRS outperforms established confidence-, distance-, and reconstruction-based baselines, while remaining fully post-hoc and lightweight. The code is available at this https URL.

[CV-117] Learning Entropy Signature for Image Representation and Classification

链接: https://arxiv.org/abs/2606.22634
作者: Jan Glaser,Ivo Bukovsky,Noriyasu Homma,Marcel Jirina
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2026 13th IEEE International Conference on Intelligent Systems, IS’26 submission 65

点击查看摘要

Abstract:Learning Entropy (LE) has recently been extended to image analysis through Spatial Learning Entropy Maps (SLEMs), which are two-dimensional LE distributions that highlight unusually high learning activity across an image. Unlike conventional image descriptors, SLEMs are generated by incremental, sample-wise learning of a pretrained feedforward MLP network, where local pixel neighborhoods are presented sequentially in a fixed spatial order to predict the corresponding central pixels. Consequently, the learning activity at each image location depends not only on its local structure but also on the knowledge acquired from previously processed locations. This paper introduces Learning Entropy Signatures (LES), an image descriptor derived from SLEM using the K largest LE locations. LES captures the spatial organization of learning-relevant image structures and provides a compact representation of image content based on learning weight behavior. Experimental evaluation on image classification tasks shows that a relatively small number of K largest LE locations preserve substantial discriminative information. The results indicate a close relationship between the learning of neural weights and information relevance, extending the role of Learning Entropy from time series to images and, within images, from structural point extraction to compact image representation and classification.

[CV-118] 4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

链接: https://arxiv.org/abs/2606.22631
作者: Chaoyue Li,Boxue Yang,Shengyao Zhou,Haoyang Wu,Rui Qian,Linfeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D dynamic scene understanding requires grounding language to a persistent worldline that binds identity, metric 3D motion, and synchronized multi-view 2D projections. Existing paradigms capture only part of this structure: large multimodal models reason over rich visual evidence but rarely preserve metric topology, while vision-language tracking remains tied to fragmented 2D or 3D outputs and local continuation. We therefore introduce \textbf4DVLT, a worldline-centered task for instruction-conditioned 4D dynamic scene understanding in fully observed multi-view video, and \textbfInstruct-4D, a benchmark with 129.4K question-answer pairs, 64.7K target entities, 851 scenes, and 9 reasoning-oriented query types. To address this setting, we present \textbf4DTrack, which casts instruction-conditioned tracking as graph-conditioned worldline inference through an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration. On Instruct-4D, 4DTrack-Qwen3.5-9B reaches 62.68 \mathrmTGA_\mathrmTop1 and surpasses the best adapted VLT baseline by 19.62 points. These results show that worldline-centered modeling improves both target grounding and recovered worldline quality. The project page is available at this https URL.

[CV-119] DR-Mamba: Automatic Inference-Time Domain Adaptation for Document Image Binarization via Sample-Conditioned Detail-Background Suppression ICDAR2026

链接: https://arxiv.org/abs/2606.22625
作者: Sheng-Wei Chan,Jen-Shiun Chiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ADAPDA 2026 (3rd Workshop on Automatically Domain-Adapted and Personalized Document Analysis), ICDAR 2026 Workshop. 17 pages, 2 figures, 9 tables. Code will be released soon

点击查看摘要

Abstract:Degraded document image binarization is sensitive to domain shifts caused by paper aging, bleed-through, stains, shadows, and uneven illumination, and the foreground-background separation of recent learning-based methods can become unstable on unseen degradation domains. We propose DR-Mamba, a sample-conditioned detail-background suppression framework that performs automatic inference-time domain adaptation for document image binarization. Unlike test-time adaptation methods that require gradient updates or auxiliary data at inference, DR-Mamba adapts to each input document through input-dependent gates within a single forward pass, requiring no target-domain labels, no fine-tuning, and no test-time parameter updates. Instead of using Mamba-style selective scanning as a single generic feature path, DR-Mamba reinterprets it as fast-slow route modeling: a fast detail route captures local stroke structures, while a slow background route accumulates spatially persistent degradation responses. The two routes are integrated through an input-dependent subtractive gate that explicitly suppresses background interference rather than fusing features by addition or concatenation. We further add full-resolution detail-guided reconstruction and thin-stroke-aware supervision to recover fine strokes lost during downsampling. Evaluated under a leave-one-year-out protocol on DIBCO-style benchmarks, where each held-out year is treated as an unseen degradation domain, DR-Mamba shows that per-document, per-location subtractive suppression improves cross-domain robustness, with particularly strong performance on the most severely degraded held-out fold.

[CV-120] OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLM s

链接: https://arxiv.org/abs/2606.22617
作者: Hao Vo,Phu Loc Nguyen,Khoa Vo,Sieu Tran,Duc Minh Nguyen,Ngo Xuan Cuong,Nghi D. Q. Bui,Anh Nguyen,Duy Minh Ho Nguyen,Ngan Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance on 2D visual tasks, yet enhancing their spatial intelligence for real-world applications such as Autonomous Vehicles (AV) remains an open challenge. Existing geometry-aware MLLMs typically rely on auxiliary 3D models at inference time, introducing pipeline complexity and the risk of cascading failures. In this paper, we present OmniSpace, a simple yet effective plug-and-play paradigm for geometry-aware spatial reasoning from purely 2D observations. Motivated by our finding that current MLLMs are bottlenecked by weak cross-view correspondence and depth estimation, OmniSpace introduces a Camera Pose Injector, a Multi-view Epipolar Attention module, and a 3D Geometric Distillation objective that jointly address these two limitations by transferring geometric knowledge into the model. Extensive experiments show that OmniSpace surpasses existing methods on planning benchmarks (nuScenes, Bench2Drive), risk detection (nuInstruct), language (Omnidrive), and generalization (DriveBench).

[CV-121] MapReason -OSM: Can Vision-Language Models Make Graph-Verifiable Mobility Decisions from Street Maps ?

链接: https://arxiv.org/abs/2606.22597
作者: Srinivas Venkatanarayanan(1 and 2),Clement Pakkam Isaac(1 and 3) ((1) NVIDIA, (2) University of Central Florida, (3) University of South Florida)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures. Submitted to ACM SIGSPATIAL 2026 (Industrial Track). Code and data: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used to read maps for logistics, delivery, and accessible navigation, where the output is an actionable decision (a route, a pin, a parking choice) that must respect the road network. Yet most map benchmarks grade free-text or multiple-choice answers that cannot be verified against the underlying graph. We present \textbfMapReason-OSM, a benchmark and evaluation harness for graph-verifiable mobility decisions on self-rendered OpenStreetMap panels. We render fixed-style maps for ten U.S. downtowns at two aligned zoom scales, overlay a consistent marker grammar, and pair each panel with a hidden street graph and exact oracles, yielding 6,000 instances (12,000 panels across the two zooms) over 12 routing, facility-location, and visual-disambiguation tasks. Models return structured decisions that we snap back to the graph and score for validity, legality, optimality, and constraint satisfaction, plus \emphcross-zoom consistency. Across seven VLMs, models read maps and route simply but fail at graph-cost reasoning (single-facility pin placement is near chance even for frontier reasoning models), and are frequently scale-inconsistent. We release the benchmark, harness, and deterministic generator.

[CV-122] he Power of Light: Improving Synthetic-to-Real Domain Adaptation through Physically-Based Indirect Illumination

链接: https://arxiv.org/abs/2606.22574
作者: Hooman Tavakoli Ghinani,Tatjana Legler,Martin Ruskowski
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:While synthetic data generation resolves the manual labeling bottleneck in computer vision, minimizing the syn-to-real domain gap requires optimizing rendering variables. This paper presents a systematic study analyzing the impact of lighting configurations and background complexity on object detection performance. We introduce SmartSDG, an automated, reproducible pipeline built on NVIDIA Isaac Sim using Physically-Based Shading (PBS), alongside ILLUM_INTRUCK, a new multi-object industrial benchmark dataset. Through 18 controlled experiments utilizing a state-of-the-art YOLOv12 framework, we demonstrate that complex, indirect lighting configurations paired with domain-relevant background variability significantly increase visual cue richness. Our quantitative findings show that avoiding direct specular peaks preserves crucial surface textures, mitigates the domain gap, reduces false positives, and accelerates model convergence compared to using conventional direct-light synthetic data. Ultimately, we provide actionable virtual scene design guidelines to maximize object detection robustness in industrial automation.

[CV-123] SeFi-Image: A Text-to-Image Foundation Model with Semantic-First Diffusion

链接: https://arxiv.org/abs/2606.22568
作者: SeFi-Team
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training image generation foundation models consumes substantial resources. Previous methods have attempted to leverage semantic guidance to accelerate the training process, yet their experiments were only conducted on simple datasets such as ImageNet, at low resolutions, and with small-scale models. In this paper, we propose SeFi-Image, a text-to-image foundation model built upon semantic-first diffusion, a novel latent diffusion modeling paradigm. We instantiate SeFi-Image at three model scales, 1B, 2B, and 5B parameters, enabling systematic study of scaling behavior and flexible deployment under varying compute budgets. Notably, our largest 5B model was trained with merely 125K A800 GPU hours, corresponding to roughly 10-20% of the training compute used by Z-Image. However, it achieves results comparable to or even superior to Qwen-Image and Z-Image. Despite this modest training compute, SeFi-Image achieves strong performance on a wide range of benchmarks, including GenEval, DPG, LongTextBench, OneIG, and CVTG-2K. Moreover, we provide DMD2-distilled few-step turbo variants for each model scale to accommodate diverse hardware constraints and latency requirements. We publicly release our code, weights and hope this work offers the community useful insights into semantic-guided diffusion modeling for T2I generation, while also providing practical and readily deployable model options.

[CV-124] HiMatch-AD: DINOv3-driven Hierarchical Matching for Training-free Medical Anomaly Detection

链接: https://arxiv.org/abs/2606.22556
作者: Jiayu Huo,Jingyuan Hong,Meng Zhou,Liyun Chen,Le Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Anomaly detection is essential for medical image analysis, where pathological regions often appear as rare deviations from normal anatomical structures. While training-based methods have achieved promising performance, they require task-specific optimization and extensive normal data, which limits scalability across modalities and institutions. Training-free approaches offer greater flexibility by leveraging pretrained visual representations, yet existing methods typically rely on simple nearest-neighbor retrieval and naive aggregation strategies, which may fail to capture hierarchical semantics and ignore the reliability of multiple anomaly responses. In this work, we propose HiMatch-AD, a DINOv3-driven hierarchical matching framework for training-free medical anomaly detection. Our method first retrieves semantically relevant normal references via dual-branch matching that jointly considers global CLS-token similarity and patch-level representations. Hierarchical anomaly maps are then generated across multiple transformer stages by comparing clustered normal features with query representations. To robustly aggregate anomaly responses, we introduce a unified uncertainty-based fusion mechanism that adaptively weights maps according to their reliability. The entire framework operates without any task-specific training. Extensive experiments on the BMAD benchmark, including brain MRI, liver CT, and retinal OCT datasets, demonstrate that HiMatch-AD consistently outperforms both training-based and DINO-based state-of-the-art methods, which highlights the effectiveness of multi-level matching and uncertainty-aware fusion for scalable medical anomaly detection.

[CV-125] Mitigating Measurement-Induced Training Instability in Hybrid Quantum Neural Networks for Protein Classification

链接: https://arxiv.org/abs/2606.22551
作者: Milton Mondal,Sushovan Chanda,Mohamad Mahdi Alawieh,Brijesh Sukhadiya,Donatus Krah,Clinton Gonsalves,Antonios Ntolkeras,Silvio O. Rizzoli,Ali H. Shaib
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hybrid Quantum Neural Network (QNN) classifiers produce logits as expectation values of quantum measurement operators. For standard Pauli measurements, these outputs are intrinsically bounded to the interval [-1,1]. When such bounded logits are used directly with the cross-entropy loss applied to softmax-normalized logits for multi-class classification, the loss function operates in a regime of weak sensitivity to logit differences. As a consequence, parameter gradients are suppressed, leading to unstable optimization in variational quantum classifiers (VQCs). In this work, we identify this effect as measurement-induced logit contraction, a previously uncharacterized source of trainability degradation in hybrid QNNs. To address this limitation, we introduce a learnable scaling parameter, termed Quantum Measurement Temperature (QMT), which rescales quantum measurement outputs prior to the loss. Unlike post-hoc calibration, QMT acts during training and compensates for the physically imposed bounds on quantum measurement outputs. This rescaling increases gradient magnitude and variance, thereby improving loss sensitivity. The proposed mechanism is architecture-agnostic and does not modify the quantum ansatz, circuit depth, or measurement operators. Experiments on fluorescence microscopy images and a six-class variant of Fashion MNIST demonstrate that QMT consistently enhances logit separation, strengthens gradients, stabilizes training across random initializations, and improves classification accuracy, relative to unscaled measurement readouts. These results demonstrate that QMT enables stable and reliable training of hybrid QNNs for practical applications.

[CV-126] Venice-H1: Failure-Aware Query Re-Ranking with Multi-Scale Grid Signatures for Referring Image Segmentation

链接: https://arxiv.org/abs/2606.22546
作者: Nicolò Savioli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures. Code: this https URL Model: this https URL

点击查看摘要

Abstract:Modern Referring Image Segmentation (RIS) systems generate multiple candidate masks per expression but rely on a simple heuristic–typically the argmax detection score–to select the final output. We identify query selection as a failure-case bottleneck: although heuristic selection succeeds on 82-93% of samples, the residual 7-18% of failures dominate the error budget, leaving a best-query selection gap of 3-11% mIoU. We introduce Venice-H1, a lightweight, backbone-decoupled post-hoc re-ranking module that encodes each candidate through multi-scale grid signatures–compact spatial descriptors pooled onto 4x4, 8x8, and 16x16 grids–and feeds them to a Transformer-based re-ranker with a Failure Gate (ROCAUC 0.78-0.82) that intervenes only when the default choice is likely suboptimal. Instantiated on DeRIS-L and DeRIS-B, Venice-H1 achieves delta_fail of +1.40 and +0.89 mIoU with strictly positive 95% CIs on all 16/16 (split, backbone) pairs and harmful-switch rates below 0.53%. Zero-shot transfer to medical referring segmentation (MS-CXR, M3D-RefSeg-2D) yields +1.16 and +0.51 mIoU without RIS-backbone fine-tuning. The module adds approximately 11.3M parameters and under 1 ms latency.

[CV-127] MAPS: Multi-Anchor Projection Similarity for Joint Vision-Language Geo-Localization

链接: https://arxiv.org/abs/2606.22543
作者: Yutong Hu,Siyuan Tan,Shaocheng Yan,Pengcheng Shi,Qingwu Hu,Jiayuan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans localize places by integrating perceptual cues from vision with semantic reasoning from language, forming a scene understanding that is both intuitive and structured. Although existing geo-localization models have made substantial progress in cross-view and cross-modal settings, they are largely built upon point-to-point alignment, which is insufficient for joint vision-language queries. In such queries, visual and textual cues do not simply act as independent references, but jointly define a semantic subspace for locating the target. In this paper, we formulate vision-language geo-localization (VLGL) with joint image-text queries as a multi-anchor geometric alignment problem and propose a unified framework for this setting. To realize this formulation, we propose Multi-Anchor Projection Similarity (MAPS), a new metric which constructs an anchor plane from visual and textual query features in a high-dimensional space and measures similarity by the projection length of the target feature onto this plane. Unlike cosine similarity which evaluates isolated pairwise relations, MAPS captures the geometric consistency between the target feature and the joint query subspace, providing a more discriminative ranking criterion during retrieval. To make the learned representation consistent with this geometry, we further introduce a MAPS-based contrastive loss that drives target features toward the corresponding anchor plane. The proposed framework, similarity metric, and training objective jointly yield state-of-the-art performance in VLGL.

[CV-128] PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models ECCV2026

链接: https://arxiv.org/abs/2606.22540
作者: Xianghui Wang,Feng Chen,Wenbo Zhang,Hua Yan,Zixuan Wang,Changsheng Li,Yinjie Lei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic \textbfpolicy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose \textbfPolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3 \times and reduces physical execution steps by 51.4%. Ultimately, our framework delivers up to a 5.83 \times end-to-end deployment speedup without compromising task success rates.

[CV-129] NegAS: Negative Label Guided Attention and Scoring for Out-of-Distribution Object Detection with Vision-Language Models

链接: https://arxiv.org/abs/2606.22537
作者: Yingjie Zhang,Shuai Li,Peng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is essential for ensuring the robustness and reliability of object detection systems deployed in safety-critical applications. While prior research has mainly focused on uni-modal detectors or vision-language model (VLM) based classifiers, the potential of VLM-based object detectors in OOD scenarios remains underexplored. In this work, we take the first step toward building OOD object detection methods upon VLMs. We identify two challenges specific to VLM detectors: (i) their text-guided attention enhances foreground with ID labels but treats background uniformly, leaving potential OOD regions unexploited for separating in-distribution (ID) from OOD instances; and (ii) their sigmoid-based multi-label outputs are incompatible with softmax-based OOD scores, calling for scoring functions consistent with VLM probabilistic outputs. Hence, we introduce Negative Label Guided Attention and Scoring (NegAS). To address (i), we propose a negative label guided attention module (NegA), where LLM-generated, visually-similar but semantically-different negative labels are used to guide attention toward potential OOD background regions. To address (ii), we introduce a novel sigmoid-based OOD scoring function (NegS) that leverages both ID and negative labels, producing strong responses for ID instances and suppressed responses for OOD ones. Extensive experiments demonstrate that our approach improves OOD detection performance by a large margin while maintaining ID accuracy, e.g., reducing the FPR95 by 11.4% on the COCO dataset and 25.5% on the OpenImages dataset compared to the baseline model. While initially designed for dense VLM detectors like YOLO-World, we successfully adapt NegAS to Grounding DINO, a query-based VLM transformer and achieve significant improvements, demonstrating the generalizability of our framework.

[CV-130] rajectory Forcing: Structure-First Generation with Controllable Semantic Trajectories

链接: https://arxiv.org/abs/2606.22527
作者: Merve Kocabas,Gege Gao,Bernhard Schölkopf,Andreas Geiger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion and flow-based generative models produce strong images, yet their controllability remains largely endpoint-centric: users specify conditions and receive final outputs, while the intermediate generative dynamics remain hidden. Recent methods have begun to exploit generation order and process decomposition to improve sample quality, but still treat intermediate states as internal computation rather than objects for interaction. We propose Trajectory Forcing (TF), a trajectory-centric framework that makes the generation path explicit, semantic, and editable. TF organizes synthesis as a sequence of semantically structured stages, progressing from global layout to object-, part-, and detail-level representations. Each stage produces a decodable latent state that can be inspected, evaluated, and locally edited before the next stage begins. To instantiate this path, we derive coarse-to-fine teacher hierarchies by clustering pretrained visual representations such as DINOv2, and train a hierarchy-conditioned one-step flow-matching model at each level. We further introduce trajectory-aware metrics that measure structural consistency and local controllability beyond endpoint quality metrics such as FID. Experiments show that TF achieves competitive sample quality while exposing coherent intermediate states and supporting localized edits across semantic levels. By shifting the focus from final images to the generative path itself, TF opens a route toward controllable, trajectory-aware image synthesis.

[CV-131] Projection-Volume Fidelity Divergence: Diagnosing and Controlling Optimization Drift in Sparse-View 3D Gaussian Tomography

链接: https://arxiv.org/abs/2606.22525
作者: Yikuang Yuluo,Ao Wang,Shen Kuan,Yujie Liu,Wang Liao,Ying Chen,Shuangyang Zhong,Yixing Huang,Fuquan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Sparse-view computed tomography is a severely ill-posed inverse problem, where recent 3D Gaussian Splatting methods offer an efficient explicit representation for tomographic reconstruction. However, we find that projection-domain optimization can be misleading in this setting: the rendered projections may continue to improve while the reconstructed volume deteriorates. We identify this failure mode as Projection-Volume Fidelity Divergence (PVFD), a representation-level optimization drift caused by anisotropic Gaussian deformation and view-specific primitive co-adaptation under sparse Radon constraints. To characterize this behavior, we introduce geometry- and volume-level diagnostics that measure needle-like Gaussian degeneration and the stability of the voxelized density field. Based on these observations, we propose LADES, a ground-truth-free optimization controller for sparse-view Gaussian tomography. LADES combines Linearly Annealed Dropout, which applies strong stochastic masking in early training to disrupt premature primitive co-adaptation and gradually restores full capacity for structural consolidation, with Structure-Aware Early Stopping, which terminates densification according to the saturation of Gaussian population growth rather than validation PSNR. Experiments on sparse-view CT reconstruction show that LADES improves volumetric fidelity, suppresses structural degeneration, and substantially reduces training time while maintaining competitive projection accuracy. These results suggest that robust Gaussian-based tomography requires monitoring and controlling volumetric structure, rather than optimizing projection fit alone.

[CV-132] he Scissors Effect: When Resize-Based Input Diversity Helps or Hurts Transfer Attacks

链接: https://arxiv.org/abs/2606.22516
作者: Yuhang Jiang,Xiaojing Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 11 figures, 29 tables

点击查看摘要

Abstract:Input Diversity (DI), which applies random resizing and padding at each attack iteration, is a near-default ingredient of transfer-based adversarial attacks, widely assumed to improve transferability. We show this assumption is regime-dependent and, for robustly trained surrogates, often reversed. Varying only the surrogate, increasing the DI probability raises transfer success for standard surrogates but lowers it for robust ones: the two response curves separate like a pair of scissors, a pattern we call the Scissors Effect. The effect is strong and consistent on ImageNet, where blind DI costs the robust source 10.3% attack success on average across CNN, ViT, Swin, and ConvNeXt targets and across ten attacks spanning 2018-2024; it is smaller on CIFAR-10 unless DI is made aggressive. A controlled robustness-strength sweep that varies only the training budget shows the harm is graded rather than binary, crossing from beneficial to harmful already in the little-robustness regime. We trace it to gradient geometry: a resize/translation decomposition attributes roughly 67% of the harm to resize, and a direct source-target gradient-alignment measurement confirms the same resize operation improves alignment for standard surrogates but degrades it for robust ones. We summarize the regime with Local Gradient Consistency (LGC), a single input-space probe that separates the two surrogate types, and prove a bias-variance crossover theorem isolating where DI helps from where its resize bias dominates. A training-free rule (CG-DI) that disables diversity when LGC is high avoids the loss on robust surrogates while keeping DI’s benefit on standard ones, positioning the Scissors Effect as a DI-specific manifestation of the broader robustness-transferability trade-off.

[CV-133] Biological Sex Determination in Cadavers Using Deep Learning Algorithms from Computed Tomography Images of Pelvis and Skull

链接: https://arxiv.org/abs/2606.22515
作者: Giovanna Herculano Tormena,Davi Nascimento Araújo,Germano Coimbra Soares de Carvalho,Gustavo Bruno Centenaro,Rafael Janowski Pozzer,Rodrigo Akira Azevedo Kurosawa,Danilo Aires Alves,Filipe Thiago Xavier de Campos,Pedro Henrique Macedo dos Santos,Pedro Augusto Prado Mota,Ricardo V. Godoy,João Manoel Herrera Pinheiro,Marcelo Becker
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Sexual identification of decomposed cadavers challenges traditional methods dependent on visual anthropological analysis. This study evaluates state-of-the-art deep learning (including YOLO26, YOLO11, ConvNeXt-Tiny, EfficientNetV2, ViT-B16, VGG16, and ResNet50) with transfer learning to automatically determine biological sex from forensic computed tomography (CT) scans. We analyzed 141 autopsied cadavers from the Forensic Medical Institute of Goiânia-GO, including a broad age range and varying conditions of preservation. The three-dimensional reconstructions of the pelvis and skull were converted into standardized two-dimensional profile projections, contributing to the study of this new technical approach. Data augmentation techniques compensated for sample limitations. Two scenarios were validated: binary and quaternary classification (one class per sex vs. one class per anatomical region of each sex). The best-performing model achieved highly consistent results on the pelvis region and still satisfactory performance on the skull region, reaching an overall patient-level accuracy of 95.65%, recall of 92.86%, F1- score of 94.36%, and precision of 97.22%, maintaining consistent performance across the evaluated cases, including those with trauma-related artifacts. Results indicate the technical feasibility of the methodology, demonstrating that deep learning models can provide objective, high-speed skeletal analysis. Since the study was conducted using data from a single institution and a single computed tomography scanner, further validation across multiple centers and scanners is required to assess the generalizability of the proposed approach

[CV-134] Benchmarking Vision-Language Models for Microscopic Plant Image Understanding

链接: https://arxiv.org/abs/2606.22497
作者: Tianqi Wei,Xin Yu,Zhi Chen,Scott Chapman,Zi Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Microscopic imaging provides essential visual evidence for studying plant biology and pathology at the cellular and subcellular levels. However, existing benchmarks on vision-language models primarily focus on macroscopic plant imagery, while the microscopic domain remains underexplored. To address this gap, we present PlantMicro, a comprehensive benchmark for evaluating vision-language models (VLMs) in microscopic plant imagery. PlantMicro integrates more than 5,000 images collected across diverse hosts, biological domains, and imaging modalities. Building on this diversity, we design a set of complementary tasks that capture different facets of microscopic image understanding. To support these tasks, we construct over 9,000 VQA pairs that systematically evaluate the capabilities of VLMs. Experiments on PlantMicro show that current VLMs struggle with fine-grained recognition and biologically grounded reasoning. For example, GPT-5 achieves 34.93% accuracy on the pathogen classification task, which is only modestly above the random-guessing baseline. The results highlight a significant gap in current VLMs’ ability to comprehend plant microscopic images. PlantMicro provides a standardized foundation for advancing VLMs toward reliable and comprehensive microscopy-level plant understanding.

[CV-135] FetSelect: Task-Specific Architectures and Self-Supervised Learning for Automated Fetal Ultrasound Frame Selection

链接: https://arxiv.org/abs/2606.22487
作者: Mahmood Alzubaidi,Raden Muaz,Uzair Shah,Mohammed Ammar,Khalid Alyafei,Mowafa Househ,Marco Agus
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 30th Conference on Medical Image Understanding and Analysis

点击查看摘要

Abstract:Automated frame selection for fetal biometry remains under addressed, with most prior work targeting generic quality assessment or downstream measurement pipelines that assume suitable frames are available. We introduce FetSelect, a task-specific framework that pairs a frozen vision foundation backbone with a hybrid multi-head design: a Task-Gated classification head and a Detection-derived quality head combined via learned fusion. We curate 6,486 expert-labeled frames across four targets: Crown-Rump Length (CRL), Nuchal Translucency (NT), Nasal Bone (NB), and Scalebar, and adapt the backbone with BYOL pretraining on 19,019 unlabeled images. On a held-out test set (974 frames), FetSelect achieves mean AUROC 0.956 and mean correlation 0.818 with expert quality annotations. Ablations confirm that hybrid fusion surpasses single-head variants, and ultrasound-specific self-supervision yields consistent gains. Evaluation on external clinical videos and 509 external CRL images demonstrates task-specific discrimination.

[CV-136] Lighting-Consistent Object Transfer Across Radiance Fields

链接: https://arxiv.org/abs/2606.22481
作者: Nicolás Violante,George Kopanas,Linus Franke,Julien Philip,George Drettakis
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is widely used to capture and render real scenes. Compositing objects from one capture into another has applications in many domains, such as VFX, architecture and interior design, or marketing. However, extracting an object from a source scene and naively pasting it into a target scene will fail to produce realistic results due to the different lighting conditions between the two scenes. To address this problem, we introduce a diffusion model that harmonizes naively composited images with inconsistent lighting. The model is trained with a heterogeneous dataset of image pairs (inconsistent composite input, consistent output), combining synthetic, generated, and real data. Our complete 3D solution allows a user to extract an object from the source scene and composite it into the target scene. From this, the (inconsistent) views of the target scene with the composite object are rendered. Our diffusion model harmonizes each one of these views, which are finally consolidated in a 3DGS representation with a post-optimization step. Our method provides visually compelling results, making object transfer between 3DGS easy to use and significantly improving quality compared to previous methods.

[CV-137] Physically-guided Image Generation for Multi-Projection Mapping

链接: https://arxiv.org/abs/2606.22477
作者: Xingyun Liu,Yuqi Li,Jinhui Xiang,Pinyan Tang,Chong Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Projection Mapping (PM) enables seamless superimposition of digital content onto real-world 3D objects, serving as a fundamental technique for immersive visualization, digital twins, and interactive art. Although text-to-image diffusion models have greatly facilitated customized content creation, directly integrating them into practical PM pipelines remains challenging due to the mismatch between idealized 2D generation and physical constraints. To bridge this gap, this paper formalizes two application-level generative paradigms: the cooperative paradigm (harmonizing generated semantics with physical attributes) and the adversarial paradigm (eliminating surface interference via radiometric compensation). Based on this, we propose ConPhyG, a unified controllable physically-guided generative multi-projection mapping framework that enables creators to interactively adjust physical constraints and flexibly switch generative paradigms. In cooperative mode, multi-dimensional physical priors (per-pixel gamut, depth, and edges) are injected into the diffusion process. In adversarial mode, the framework releases the generative potential and applies bounded numerical optimization for multi-projector radiometric compensation. It allows users to dynamically switch constraints to balance artistic freedom with physical feasibility. Furthermore, we extend ConPhyG to 360-degree multi-view consistent PM using a sequential generation strategy. Quantitative and qualitative evaluations on a real-world four-projector setup demonstrate that ConPhyG significantly outperforms state-of-the-art methods in geometric alignment, gamut utilization, and semantic fidelity.

[CV-138] CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

链接: https://arxiv.org/abs/2606.22476
作者: Ruixun Liu,Lingyu Zhang,Lanxuan Xue,Kaiyu Li,Bowen Fu,Xiangyong Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans can effortlessly reason about scenes across different viewpoints, yet it remains unclear whether Vision-Language Models (VLMs) possess similar cross-view spatial abilities. Satellite-street scene pairs, with their complex contexts and extreme viewpoint variations, provide an ideal testbed. Motivated by this, we introduce CVSBench, a large-scale benchmark for evaluating cross-view spatial reasoning through satellite-street pairs. This benchmark supports multiple tasks, including cross-view VQA, cross-view grounding, and viewpoint identification. CVSBench comprises 3,297 cross-view image groups with 9,468 object-level annotations and 40,679 question-answer (QA) pairs, enabling systematic and controlled evaluation of cross-view spatial reasoning. Extensive evaluations reveal that advanced VLMs struggle to maintain object-level and layout consistency under drastic viewpoint changes. To bridge this gap towards human-like spatial cognition, we investigate two categories of approaches: spatially grounded reasoning and the incorporation of cognitive map inputs. Our findings demonstrate that language-only reasoning yields marginal improvements, while incorporating visual spatial imagination via a 3D scene imagination pipeline substantially improves cross-view reasoning. These results highlight the necessity of explicit visual-spatial representations for robust spatial cognition in VLMs. Our data and code are released at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.22476 [cs.CV] (or arXiv:2606.22476v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.22476 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-139] DreamUV: Unwrap Artist-like UV by End-to-End Flow Matching

链接: https://arxiv.org/abs/2606.22445
作者: Quanyuan Ruan,Jiabao Lei,Xingyi Du,Xifeng Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:UV parameterization is a fundamental step in 3D content creation, yet producing production-ready UV layouts remains challenging due to the gap between geometric distortion objectives and the stylistic preferences of professional artists. While classical methods optimize handcrafted energy functions, artist-authored UVs exhibit structural patterns such as straightened seams, axis-aligned islands, and flexible interior deformation, properties that are difficult to explicitly formulate. In this work, we present DreamUV, an end-to-end learning framework that formulates UV unwrapping as a generative Flow Matching problem. Rather than predicting a single optimal parameterization, DreamUV learns a mesh-conditioned transport process that maps noise samples to a distribution of artist-like UV layouts. To reflect real-world authoring practices, we introduce a boundary-aware training strategy that prioritizes seam geometry, and a Model-in-the-Loop Finetuning(MITL) scheme that explicitly accounts for discretization errors during sampling and stabilizes transport dynamics under heterogeneous supervision. We evaluate DreamUV on a large-scale dataset of professionally authored UV layouts. Experiments demonstrate that our method produces significantly straighter boundaries and tighter axis-aligned islands than both classical and learning-based baselines, while maintaining competitive distortion metrics. Qualitative results and a user study with professional artists further confirm that DreamUV generates UV layouts that are not only valid, but aligned with practical production requirements.

[CV-140] Curvature-aware 3D length estimation of greenhouse cucumbers using RGB-D imaging and cubic spline arc-length integration

链接: https://arxiv.org/abs/2606.22439
作者: Manveen Kaur,Rajmeet Singh,Saeed Mozaffri,Shahpour Alirezaee
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Commercial greenhouse cucumber production is graded by fruit length, which drives harvest scheduling, labour allocation, and logistics. Manual measurement with thread or caliper is accurate but infeasible at commercial scale. This paper presents CucumberVision, a non-contact length estimation framework using an Intel RealSense D435 RGB-D camera. A YOLO26n instance segmentation model locates cucumbers, and SAM (ViT-B backbone) refines each detection to a pixel-precise mask. Five methods are evaluated under matched conditions: (M1) a dominant-axis skeleton scan-line baseline; (M2) PCA on the bounding-box depth point cloud; (M3) SAM mask with medial-axis skeletonisation; (M4) a hybrid keypoint-guided approach using a YOLO26-pose model predicting five anatomical landmarks (KP0–KP4) with piecewise 3D arc-length; and (M5) a novel medial arc spline method fitting a cubic spline through the 3D medial axis of the SAM mask and computing arc length by trapezoidal integration – the first such application to elongated vegetable measurement. All methods share five-frame burst depth averaging, colour-stream intrinsic alignment, and adaptive method selection with cascading fallbacks ensuring 100% coverage. A benchmark of 48 captures across seven cucumbers in three size categories (small ~8 cm, medium ~13 cm, large ~25 cm) with thread-based ground truth establishes a significant accuracy hierarchy: M1 (MAPE 9.68%) M2 (5.31%) M4 (5.51%) M3 (5.82%) M5 (4.13%). M5 significantly outperforms all competitors at Bonferroni-corrected alpha=0.0125. A secondary contribution is identifying a 12–18% length underestimation caused by using depth-stream rather than colour-stream intrinsics after this http URL(this http URL) – an under-reported error source. The complete system is released open source and runs in real time on a single consumer-grade GPU.

[CV-141] MMGist: A Comprehensive Multimodal Benchmark for 2027

链接: https://arxiv.org/abs/2606.22437
作者: Wenzhen Yuan,Jiacheng Ruan,Wutao Xiong,Chengping Zhao,Ting Liu,Yuzhuo Fu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We conduct a systematic study of 18 widely used vision-language benchmarks and identify three major issues: 1) many items do not rely on visual cues and therefore fail to effectively measure multimodal understanding; 2) many items are already close to performance saturation for current LVLMs, which limits their discriminative power; 3) a small number of anomalous items affect the reliability of evaluation results. To this end, we propose MMGist, a curated benchmark that covers seven capability dimensions and contains 7,262 items. MMGist is constructed through a three-stage pipeline, which sequentially combines text-ablation filtering, cross-model saturation filtering, and anomaly detection filtering. We conduct extensive experiments on 27 leading LVLMs and compare MMGist with the raw pool of 23,250 items. The results show that MMGist preserves model rankings with high fidelity, with Spearman \rho = 0.98 , while reducing evaluation items by 69% and improving cross-model discrimination by 78%. Further results indicate that Visual Logic remains a systematic weakness of current LVLMs, while knowledge-intensive dimensions such as Expert Knowledge dimensions remain important factors for distinguishing closed-source models from open-source models. These findings suggest that high-quality evaluation should prioritize visual dependency, discriminative power, and reliability, rather than simply pursuing benchmark scale.

[CV-142] FlowDec: Temporal Conditional Flow Decorruptor for Robust Continuous Vision-Language Navigation

链接: https://arxiv.org/abs/2606.22424
作者: Yufei Zhang,Changhao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions in unseen scenes. While Large Models (LMs) have advanced VLN-CE, their performance remains severely degraded by real-world visual corruptions, a critical yet underexplored domain constraint. We introduce Temporal Conditional Flow Decorruptor (FlowDec), a novel image restoration framework tailored for LM-based VLN-CE. FlowDec integrates a hybrid temporal conditioning strategy to align the generative flow path with historical context and employs action-centroid guided filtering to dynamically assess and integrate outputs. Extensive experiments demonstrate that FlowDec outperforms state-of-the-art decorruption methods in both navigation accuracy and generation latency. Our approach establishes a robust, efficient paradigm for resilient embodied navigation in unpredictable real-world conditions.

[CV-143] Gen2Balance: Generative Balancing for Long-Tailed Video Action Recognition ECCV2026

链接: https://arxiv.org/abs/2606.22416
作者: Prajwal Gatti,Simon Jenni,Fabian Caba Heilbron,Dima Damen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:We address the problem of training on long-tailed data for video action recognition. We propose to augment the training set using a text-to-video generative model, conditioned on diverse text prompts grounded in action profiles and training exemplars. Our approach, called Gen2Balance, converts an imbalanced training set into a balanced combination of real and generated video clips. To effectively learn from such data, we employ a two-stage training strategy that mitigates domain shift and yields significant improvements. We evaluate on long-tailed versions of standard benchmarks: UCF-101 (UCF-LT) and a 100-class subset of Kinetics (K100-LT) selected to prioritise temporally challenging actions. Gen2Balance improves accuracy over the strongest baselines for long-tailed learning by 5.1% and 7.0% on the respective datasets. On rare actions from the RareAct dataset (e.g., cut keyboard), Gen2Balance improves accuracy by 31.9%, demonstrating effectiveness for scarce actions. By varying the amount of synthetic data added, we show that partial balancing already achieves 79% of the performance gains at 27% of the compute cost on K100-LT, highlighting the practical scalability of Gen2Balance.

[CV-144] Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

链接: https://arxiv.org/abs/2606.22409
作者: Haodi Liu,Xinhang Yang,Kunda Yan,Sen Cui,Zeyu Zhang,Changshui Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots operating in everyday environments must understand fine-grained human actions, intentions, and contextual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human-robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to identify critical action-relevant details, Selective Socratic Questioner validates and refines these details through selective self-questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment classification. Extensive experiments on our curated instruction-tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial performance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at this https URL.

[CV-145] Multi-cancer detection using a computationally efficient CNN with transfer learning

链接: https://arxiv.org/abs/2606.22400
作者: Vasileios E. Papageorgiou,Georgios Petmezas,Dimitrios-Panagiotis Papageorgiou,Leandros Stefanopoulos,Nicos Maglaveras
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces a computationally efficient convolutional neural network (CNN) architecture enhanced with transfer learning for multi-cancer detection using biomedical images. The proposed lightweight CNN model is designed to reduce computational complexity while maintaining high classification performance, making it suitable for deployment in resource-constrained environments. We evaluate this approach on three distinct tumor datasets comprising brain magnetic resonance imaging (MRI) and lung and kidney computed tomography (CT) scans. The model achieves test accuracy of 90.85 ± 2.22%, 98.64 ± 2.43% and 99.92 ± 0.08% for brain, lung, and kidney cancer classification, respectively, using 5-fold stratified cross-validation (CV). Transfer learning is employed by pretraining the model on one cancer type and fine-tuning it on the others, requiring only 20 additional epochs to achieve performance comparable to models trained from scratch. The fine-tuning process involves updating the classification part of the CNN and requires approximately 0.014 seconds per image per epoch using an NVIDIA GeForce GTX 960. Comparative evaluations show that the proposed model outperforms several state-of-the-art pretrained architectures, such as Xception, VGG16, VGG19, MobileNetV2 and DenseNet121. Overall, the model’s effectiveness is evaluated across three types of cancer with distinct morphological characteristics, assessing its performance on both MRI and CT imaging modalities and demonstrating robust performance across diverse tasks and data types. These findings underscore the potential of streamlined deep learning (DL) frameworks in accelerating cancer diagnosis without sacrificing accuracy, especially in settings with limited computational resources.

[CV-146] Curvature-Adaptive Consistency Flow Matching: Autonomous Trajectory Optimization via Reinforcement Learning ECCV2026

链接: https://arxiv.org/abs/2606.22394
作者: Songtao Tian,Guhan Chen,Bohan Li,Jingyi Ma,Zixiong Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Consistency distillation has significantly accelerated the inference of diffusion models. In this work, we reveal an intriguing asymmetry: while Logit-Normal sampling priors are highly efficacious for standard iterative generation, consistency distillation exhibits a distinctly different difficulty profile (e.g., U-shaped). We identify that the primary optimization bottlenecks reside at the boundary stages (initialization or final refinement) rather than the intermediate steps. To address the limitations of static sampling in accommodating evolving learning requirements, we propose Curvature-Adaptive Consistency Flow Matching (CACFM). By formulating distillation as a dynamic decision process, CACFM employs a lightweight Reinforcement Learning agent to actively probe Probability Flow ODE trajectories, automatically constructing an efficiency-oriented curriculum that prioritizes critical regions without manual scheduling. Integrated with a novel Flow Distribution Matching Distillation (DMD) objective, our approach achieves new state-of-the-art results on large-scale models such as FLUX and SDXL. It effectively mitigates structural deformities and preserves high-frequency details in extreme few-step regimes, achieving unprecedented visual fidelity.

[CV-147] Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers ECCV2026

链接: https://arxiv.org/abs/2606.22383
作者: Edwin Kwadwo Tenagyei,Lei Wang,Ugochukwu Ejike Akpudo,Jun Zhou,Yongsheng Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 19th European Conference on Computer Vision (ECCV 2026)

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a practical solution for adapting large pretrained vision transformers (ViTs) to downstream tasks while updating only a small subset of parameters. However, existing adapter-based methods perform adaptation independently for each token, implicitly assuming that token refinements should be learned in isolation. This token-wise formulation overlooks the structured relationships among tokens that naturally arise in visual scenes, potentially leading to redundant updates and spatially inconsistent feature refinement. In this work, we revisit the design of parameter-efficient adapters and propose to perform adaptation in hyperedge space rather than token space. We introduce HyperAdapter, a hypergraph-based adapter architecture that enables structured, group-aware adaptation through soft token routing. HyperAdapter constructs a soft hypergraph over ViT tokens using prototype-based assignments, aggregates token features into latent hyperedge representations, applies lightweight bottleneck adaptation at the hyperedge level, and diffuses the resulting updates back to tokens via the hypergraph incidence structure. This design injects an explicit structural inductive bias into PEFT while preserving the modularity and efficiency of standard adapters. Extensive experiments across diverse visual benchmarks demonstrate that structured hyperedge adaptation consistently outperforms strong PEFT baselines under comparable parameter budgets, with particularly pronounced gains on tasks requiring structured reasoning. Our results suggest that the choice of adaptation space is a critical yet underexplored dimension in parameter-efficient transfer for ViTs.

[CV-148] Enhancing Road Safety: An IoT-Based Accident Detection and Prevention Mechanism

链接: https://arxiv.org/abs/2606.22381
作者: Prabhu Pugalenthi,Pramod Krishnaa Dhanbalan
类目: Emerging Technologies (cs.ET); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 4 pages, 4 figures, 1 table

点击查看摘要

Abstract:Road traffic accidents remain a critical global crisis, consistently serving as a primary driver of preventable mortality and severe injury. These incidents are frequently precipitated by human error, including overspeeding, driving under the influence of alcohol, and cognitive fatigue. To address this urgent public safety challenge, this paper presents an intelligent, Internet of Things (IoT)-based Accident Prevention and Detection System (APDS) designed to systematically mitigate driver risk and optimize post-collision emergency responses. The proposed framework features a multi-tiered architecture capable of executing continuous real-time telemetry monitoring, proactive local alarm triggering, and automated situational intervention. Furthermore, the system integrates automated emergency communication protocols that aggregate immediate spatial coordinates via GPS and dispatch targeted alerts to medical facilities in close proximity, thereby optimizing response times and reducing accident-related fatalities.

[CV-149] Following the Flow: Advection-Consistent Modeling for Event-based Small Object Detection ECCV2026

链接: https://arxiv.org/abs/2606.22378
作者: Wen Guo,Fulong Cai,Wuzhou Quan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026. Code: this https URL

点击查看摘要

Abstract:Event cameras enable high-frequency visual perception with microsecond latency, offering advantages for dynamic scenes. However, event-based small object detection remains challenging due to sparse asynchronous measurements and weak object responses that are easily disrupted by noise. Limited spatial support causes small-object signals to lose temporal continuity, resulting in fragmented and unstable predictions. To address this issue, we propose a physics-guided advection-consistent modeling framework, termed PACT, which formulates event evolution as a motion-driven feature transport process. Instead of relying solely on local spatio-temporal aggregation, PACT propagates features along estimated velocity fields and enforces trajectory-level consistency through advection constraints. This design preserves weak event responses over time and prevents their degradation under complex background interference. Technically, PACT integrates motion-aware feature extraction with a differentiable advection-based transport operator, enabling coherent motion representation and effective noise suppression during temporal evolution. Extensive experiments on benchmark event-based datasets demonstrate that PACT consistently outperforms state-of-the-art methods, achieving improvements of 20.72% in IoU and 15.03% in accuracy while maintaining comparable computational efficiency. The code is publicly available at this https URL.

[CV-150] owards Error-Free Long Video Generation

链接: https://arxiv.org/abs/2606.22370
作者: Shuning Chang,Weihua Chen,Jiasheng Tang,Hao Xu,Zeyu Zhang,Hangjie Yuan,Yu Lu,Ruigang Niu,Fan Wang,Bohan Zhuang,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video generation have made minute-level synthesis possible; however, generating long videos remains challenging due to error accumulation, attribute drift, and the limited availability of long video data. In this paper, we introduce an infinite-length video generation framework that focusing on addressing these issues and produces high-quality, dynamic, and identity-consistent single-shot long videos. We first finetune a diffusion model as a video extension model on large-scale short video data to autoregressively generate temporally coherent clips. Inspired by the success of large language models (LLMs), we adopt causal attention computation between clips to further finetune this model on long video data. In this way, the tokens in one clip (short video) are computed by bidirectional attention while tokens among clips are computed by unidirectional attention. This design leverages the strengths of modern diffusion models while preserving long-term context information, effectively mitigating error accumulation and attribute drift. To achieve memory efficiency during inference, we adopt a key-value (KV) caching mechanism to maintain a constant KV memory. Furthermore, we introduce truncation-rectified flow (T-RFlow) technique to further suppress error accumulation. Experimental results demonstrate the effectiveness of our method. Our framework establishes a new benchmark for realistic and coherent minute-level video synthesis.

[CV-151] Interest Entanglement: The Hidden Barrier to Blind Super-Resolution Optimization

链接: https://arxiv.org/abs/2606.22353
作者: Junxiong Lin,Xinji Mai,Qianyu Guo,Haoran Wang,Zeng Tao,Xuan Tong,Ivy Pan,Wenqiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fidelity and perceptual quality are two inherently competing and conflicting objectives in the image super-resolution (SR) task. Different loss functions focus on these objectives to varying extents. Regression losses enhance the model’s fidelity but lack sufficient attention to high-frequency details, resulting in a loss of fine details. In contrast, perception losses improve the model’s visual quality but may introduce undesirable artifacts. Balancing these two optimization goals can be viewed as a Multi-Objective Optimization problem. Existing methods are limited to cautiously adjusting weight parameters between these losses, overlooking the underlying Interest Entanglement problem. To address this problem, we explore the inherent frequency-domain conflict between the regression objective and the perceptual objective, and analyze the causes of Interest Entanglement in SR tasks. According to our findings, we propose the Shared-Feature-Representation based Super-Resolution framework (SFR), which decouples the learning process of different optimization objectives, allowing the model to explore a common optimization direction for both goals and achieve an effective balance between them. To better leverage shared features, we also proposed the InfoSqueeze module, which filters redundant information through a dimensionality reduction and expansion process, effectively transforming features into a consistent space. Quantitative and qualitative experiments across five representative datasets affirm the superiority of SFR.

[CV-152] Reliability-Guided Adaptive Ensembling for Robust Test-Time Adaptation ECML2026

链接: https://arxiv.org/abs/2606.22351
作者: Adam Koziak,Yuhong Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ECML 2026

点击查看摘要

Abstract:Test-time adaptation (TTA) can mitigate domain shift without source data, but it is highly brittle under adversarially contaminated test streams, where corrupted inputs also destabilize online updates. We study robust test-time adaptation (RTTA) in the adversarial-stream setting, which remains comparatively underexplored relative to standard TTA, and propose SAFER (Stochastic Augmentation Framework for Enhanced Robustness), a training-free reliability-guided augmentation wrapper for RTTA. SAFER preserves the wrapped TTA objective while replacing brittle single-view predictions with a reliability-guided pooled predictor. For each test sample, SAFER generates stochastic augmentations and aggregates their predictions through correlation-weighted pooling with outlier detection. We further study an adaptive-mixing extension that improves clean-performance retention by adjusting original-versus-augmentation weighting using feature disagreement signals. We evaluate on PACS, VLCS, and OfficeHome under PGD attacks at various attack rates. Across benchmarks, SAFER improves resilience of TTA methods to adversarial attacks while maintaining competitive clean performance.

[CV-153] Customizing Video Portraits via Identity-ActionDecoupling

链接: https://arxiv.org/abs/2606.22347
作者: Junxiong Lin,Haoran Wang,Xinji Mai,Zeng Tao,Xuan Tong,Ivy Pan,Wenqiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identity-Preserving Text-to-Video Generation (IPT2V) seeks to synthesize a temporally coherent video from a reference image and a textual description, while simultaneously preserving the subject’s identity and allowing fine-grained control over facial dynamics. Although recent methods such as ID-Animator and ConsisID inject identity features only at inference time, they ignored the ID-irrelevant information contained in Facial embedding, leading to monotonous or inaccurate facial movements that poorly follow the prompt. We introduce Identity-Action Decoupling (IaD) framework as well as two loss function Identity Decoupling Loss and Text Alignment Loss to solve this problem. Without any subject-specific fine-tuning, IaD yields videos that (1) maintain cross-temporal identity consistency and (2) exhibit rich, controllable expressions and scene variations that closely match the input text.

[CV-154] -IMPACT: A Severity-Aware Benchmark for Contextual Image-Text Manipulation

链接: https://arxiv.org/abs/2606.22339
作者: Gagandeep Singh,Aaditya Yadav,Priyanka Singh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Recent advances in vision-language models and generative editing systems have made it increasingly easy to produce persuasive multimodal misinformation by altering images, text, or both jointly. However, existing datasets focus mainly on authenticity, out-of-context mismatch, or manipulation type, and rarely capture how strongly an edit changes the likely interpretation of a post. We introduce T-IMPACT, a first-release severity-aware benchmark for manipulated news-style image-text pairs. T-IMPACT contains 98,786 examples spanning pristine, image-only, text-only, and joint manipulations, with a calibrated continuous severity signal, coarse low/medium/high labels, and supporting grounding metadata. Starting from a news image-text pair, the pipeline extracts semantic anchors, grounds them spatially, performs localized image edits and constrained caption rewrites, and calibrates contextual-impact scores using limited human ratings. In this release, the calibrated continuous score is the primary severity target, while the low/medium/high bands should be interpreted as coarse operating buckets rather than balanced classes. Experiments show that current models recover some authenticity signal, but severity prediction remains substantially harder and only weakly aligned with human judgment. T-IMPACT provides an initial benchmark for studying multimodal manipulation beyond binary real/fake classification toward graded contextual impact.

[CV-155] EmbodiedUS-FS: Fast Slow Intelligence for Ultrasound Robotics

链接: https://arxiv.org/abs/2606.22319
作者: Fangzhuo Zhang,Xinyu Wang,Xiao Yang,Jinchang Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic ultrasound scanning in real clinical environments requires both high-level clinical workflow reasoning and low-level closed-loop execution. Physicians natural-language instructions often contain implicit anatomical targets, procedural logic, image-quality requirements, and safety constraints, while execution is affected by patient motion, contact variations, and target drift. We propose a fast and slow hierarchical embodied ultrasound system for safe and interpretable robotic ultrasound assistance. The Slow Brain performs intent parsing and stage-wise task planning with knowledge augmentation from an API and handbook corpus, and generates executable plans through task-graph construction and structured plan verification. The Fast Brain fuses multimodal feedback, including ultrasound images, robot pose and force states, and patient-motion information, to refine local actions and perform image-quality-guided recovery behaviors. The system further integrates a Safety Shield and a hierarchical escalation policy to constrain risky actions and trigger replanning or human confirmation under persistent failures or safety-bound violations. Experiments on planning evaluation, closed-loop execution under dynamic perturbations, and safety-mechanism validation demonstrate that the proposed hierarchical design improves task success rates while reducing safety violations.

[CV-156] Diffusion Integrated Gradients: Controllable Path Generation for Flexible Feature Attribution ECCV2026

链接: https://arxiv.org/abs/2606.22314
作者: Soyeon Kim,Kyowoon Lee,Jaesik Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 22 figures, 10 tables. Accepted to ECCV 2026; includes appendix

点击查看摘要

Abstract:Path-based attribution methods such as Integrated Gradients (IG) are widely adopted for their strong axiomatic properties and effectiveness in attributing model predictions to input features by integrating gradients along a path from a baseline to the input. However, the choice of the attribution path largely affects the quality of explanations, and existing approaches rely on fixed or hand-crafted paths that often produce noisy or distorted attributions. To address this limitation, we propose Diffusion Integrated Gradients (DiffIG), a novel method that reformulates path generation as a conditional generative modeling problem. DiffIG first trains a diffusion model to learn a distribution over paths generated from a Stick-Breaking Process, then employs guided sampling to embed user guidance during the sampling procedure. We demonstrate that DiffIG quantitatively matches or outperforms existing path-based methods, achieving perceptually aligned explanations. This work introduces a new generative perspective for flexible, inference-time controllable Explainable Artificial Intelligence (XAI) methods.

[CV-157] owards Accurate and Robust Surveillance Roadside IVD via Trackletized Audio-Visual Reasoning

链接: https://arxiv.org/abs/2606.22299
作者: Xiwen Li,Xiaoya Tang,Bodong Zhang,Tolga Tasdizen
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Idling Vehicle Detection (IVD) seeks to determine, at the final frame of a video clip, whether any vehicle is idling, meaning the vehicle is stationary with its engine running, using synchronized video from a remote surveillance camera and multichannel audio captured by spatially distributed wireless microphones along the roadside. Prior full-image, clip-level fusion approaches tend to overfit scene background and full-frame context, produce unstable temporal decisions, and lack an explicit spatial prior to align vehicles with microphones, which makes them brittle under domain shift and data inefficient. Instead, we introduce TAVR-IVD, an audio-visual framework guided by multi-object tracking. Our method detects vehicles, links detections into tracklets, and classifies each vehicle by operating on its tracklet. This design raises the effective signal-to-noise ratio, stabilizes temporal decisions through tracklets, enforces an explicit spatial prior to align vehicles with microphones, and adapts across domains with limited calibration annotations while remaining detector agnostic and efficient. To evaluate deployment robustness, we further curate two evaluation extensions, AVIVD-LT and AVIVD-M, covering inter-day and cross-site shifts.

[CV-158] Efficient Document Tampering Localization with Multi-Level Discrepancy Features and Unified DCT-Quantization Embedding ECCV2026

链接: https://arxiv.org/abs/2606.22285
作者: Mohamed Dhouib,Ye Zhu,Sonia Vanier,Aymen Shabou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Localizing document tampering is extremely challenging, as manipulations are crafted to appear visually consistent and often leave only subtle traces that are nearly invisible to the human eye. In prior work, evaluation has been largely dominated by synthetic benchmarks that closely match the training distribution, and methods have shown steady progress under this setting. However, these gains often translate poorly to human-made forgeries and to cross-domain evaluation, where both the source documents and the tampering pipeline can change, leading to a distribution shift. In addition, since the introduction of the Frequency Perception Head for the discrete cosine transform (DCT) modality, it has become a standard choice, and subsequent work has largely focused on downstream modules and fusion strategies rather than revisiting the backbone itself. To help close this gap in cross-domain performance and improve the DCT backbone design, we propose \textbfDiffNet, a relatively simple yet effective RGB–DCT early-fusion architecture driven by two key design choices. First, to ensure that the decoder aggregates multi-scale inconsistency evidence rather than operating on raw, content-heavy activations, we apply a lightweight multi-level discrepancy transformation at the output of each backbone stage, replacing features with magnitude-only responses to learned zero-sum filters. Second, we design an efficient DCT-domain backbone that relies on a lightweight frequency-index-aware DCT–quantization joint embedding. Our approach achieves state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30%, with up to 7\times higher throughput than the previous best model.

[CV-159] MultiMem: Measuring and Mitigating Memorization in Multi-Modal Contrastive Learninga ECCV

链接: https://arxiv.org/abs/2606.22220
作者: Wenhao Wang,Franziska Boenisch,Michael Backes,Adam Dziedzic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at The 19th European Conference on Computer Vision (ECCV), 2026

点击查看摘要

Abstract:Memorization in machine learning models enables high performance on rare in-distribution samples by capturing their atypical patterns. However, it also causes harmful retention of noise and outliers, degrading generalization. While memorization has been extensively studied in both supervised and self-supervised learning in the vision domain, it remains unexplored in multi-modal contrastive learning. We address this gap by introducing MultiMem, the first metric designed to quantify memorization in multi-modal contrastive learning. Through our systematic analysis, we demonstrate that cross-modal semantic misalignment has the strongest influence on memorization, with text being the dominant modality driving memorization, followed by video, image, and audio. We show that targeted augmentations applied across all modalities effectively reduce memorization as measured by our MultiMem metric and improve model performance. Overall, this work establishes the first framework for measuring and mitigating memorization in multi-modal contrastive learning, preventing harmful data retention and contributing to higher-performing models.

[CV-160] Multi4D: High-Fidelity Dynamic Gaussian Splatting via Multi-Level Competitive Allocation ECCV2026

链接: https://arxiv.org/abs/2606.22197
作者: Rui Wang,Quentin Lohmeyer,Siyu Tang,Mirko Meboldt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026, project page: this https URL

点击查看摘要

Abstract:Dynamic 3D Gaussian splatting faces a fundamental tension between motion consistency and visual fidelity. Deformation-based approaches preserve temporal correspondence but suffer from motion over-factorization, oversmoothing high-frequency dynamics. In contrast, 4D-primitive methods capture fine visual details yet incur temporal overparameterization, breaking object identity and leading to severe storage overhead. To resolve this, we introduce Multi4D, a framework for high-fidelity dynamic Gaussian Splatting based on multi-level competitive allocation. Instead of a monolithic representation, we distribute modeling capacity across three structured levels: static structure, persistent dynamic geometry, and transient appearance primitives. Through shared rasterization and residual-driven optimization, these levels dynamically compete to explain photometric error, enabling adaptive specialization without pre-assigned decomposition. This allocation preserves long-term motion consistency while capturing fine dynamic detail, achieving state-of-the-art rendering quality and real-time performance with significantly fewer dynamic primitives. Furthermore, because our representation explicitly tracks compact persistent Gaussians over time, semantic features can be embedded afterward, enabling Multi4D to achieve state-of-the-art 4D segmentation accuracy with an order-of-magnitude speedup. Project page: this https URL

[CV-161] Resolving Multi-Target Association in OFDM-based ISAC via Vision-aided Multi-Modal Learning

链接: https://arxiv.org/abs/2606.22195
作者: Meng Hua,Chenghong Bian,Deniz Gunduz
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Orthogonal frequency division multiplexing (OFDM)-based integrated sensing and communication (ISAC) systems commonly extract target parameters by peak-searching a delay-Doppler map (DDM) constructed from reflected pilots. In multi-target scenarios, this results in ambiguity: the DDM does not reveal which physical target produced which peak, and two targets within the same delay-Doppler resolution cell cannot be separated. We propose a vision-assisted OFDM-ISAC framework that resolves both limitations by fusing wireless and visual modalities. The transmitter encodes an onboard street-view image with deep joint source-channel coding (DeepJSCC) and transmits it over the same OFDM waveform used for sensing; the receiver reconstructs the image, runs a fine-tuned YOLOv5 detector and fuses the resulting per-target features (bounding-box coordinates and class labels) with the DDM and transmitter-receiver geometry through a learned multi-modal network. To stabilize training of the high dimensional delay and Doppler classifiers, we introduce a Kullback Leibler loss against triangular soft labels centered on the ground-truth bin. On a Blender-rendered vehicular testbed, the proposed framework achieves a 16 cm localization root mean square error (RMSE) and a 10.8 ns delay RMSE. An ablation study confirms that removing the visual modality causes a 60x degradation in localization. These results highlight the potential of vision to overcome the data-association and resolution limits of single-modality ISAC.

[CV-162] Dual-Stream EEG Decoding for 3D Visual Perception NEURIPS2025

链接: https://arxiv.org/abs/2606.22182
作者: Ninon Lizé Masclef,Taisija Demcenko,Antonella Catanzaro,Nataliya Kosmyna
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures. Accepted at the Symmetry and Geometry in Neural Representations Workshop (NeurReps), NeurIPS 2025. To appear in Proceedings of Machine Learning Research (PMLR)

点击查看摘要

Abstract:This paper explores a novel brain decoding model for 3D shape perception through a dual pathway architecture mirroring biological vision. Our bio-inspired approach implements separate decoding modules for object identity and spatial orientation, inspired by ventral and dorsal pathways, during continuous rotations. We employ circular regression for angle prediction and develop EEG-conditioned multiview diffusion for 3D reconstruction. Our approach successfully decodes both object identity and spatial orientation from EEG signals and enables 3D reconstruction from neural activity, with interpretability analyses revealing temporally structured involvement of ventral, dorsal, and motor-related channels rather than a static ventral dominance in supporting object and angle decoding.

[CV-163] From Convolution to Transformer: A Comparative Study of U-Net Variants for Brain Tumor and Retinal Vessel Segmentation

链接: https://arxiv.org/abs/2606.22168
作者: Khoa Pham,Sindhuja Penchala,Jiacheng Li,Andy Perkins,Noorbakhsh Amiri Golilarz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation plays an important role in computer aided diagnosis, treatment planning, and disease monitoring. U-Net has been widely used for biomedical image segmentation because of its encoder decoder structure and skip connections. However, conventional convolution based U-Net models may have limited ability to capture long range dependencies and global contextual information, which can affect performance in complex segmentation tasks. This paper presents a comparative study of five U-Net based architectures: U-Net 3D, Residual U-Net, Attention U-Net, UNETR, and Swin UNETR. The models are evaluated on two benchmark datasets: BraTS 2023 for brain tumor segmentation and DRIVE for retinal vessel segmentation. Experimental results show that Swin UNETR achieves the best overall performance, with Dice scores of 0.8965 on BraTS 2023 and 0.8078 on DRIVE. The results suggest that transformer based U-Net variants are effective for segmentation tasks requiring global contextual modeling, while residual learning remains useful for fine structure segmentation. This study provides practical insights into model selection for medical image segmentation across volumetric MRI and retinal imaging tasks.

[CV-164] Improving Reasoning in Vision-Language Models via Perception Verified Self-Training

链接: https://arxiv.org/abs/2606.22158
作者: Sourabh Sharma,Sonam Gupta,Sadbhawna Thakur
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving human-like reasoning in Vision-Language Models (VLMs) remains a long-standing challenge. Recent approaches leverage Chain-of-Thought (CoT) rationales generated by human annotators or proprietary models to improve reasoning, which is costly and difficult to scale. Self-training offers a promising alternative by using models own outputs as supervision. However, existing methods often suffer from visual hallucinations – where rationales describe non-existent visual content, and language shortcuts – where predictions rely on textual priors rather than true visual grounding, as rationales are typically filtered only by answer correctness without verifying visual perception. To address this limitation, we propose a perception-verified self-training framework that enforces visually grounded reasoning. First, our method employs a CoT template (caption-reasoning-conclusion) that disentangles perception from reasoning, enabling independent verification of visual understanding. To compensate for the absence of ground-truth captions, we propose PerceptEval, an unsupervised method that evaluates caption quality based on its alignment with visual and textual elements present in the image. Using caption verification together with answer correctness, we partition the data into three subsets: easy (correct caption and conclusion), medium (correct caption but incorrect conclusion), and hard (incorrect caption). Building on this partitioning, we design a two-stage curriculum learning strategy. In Stage 1, the model is trained on easy examples and subsequently in Stage 2, medium samples are incorporated through a caption-guided reasoning enhancement procedure that regenerates reasoning conditioned on verified captions. Only regenerated samples with the correct conclusions are retained.

[CV-165] Failure Analysis in Transition: An Industry Survey of Challenges Priorities and Standardization Needs in Advanced Packaging and Heterogeneous Integration

链接: https://arxiv.org/abs/2606.22149
作者: Himanandhan Reddy Kottur,Nusra Akter Takia,Mahamudul Hassan Fuad,Istiaq Firoz Shiam,Matthew Walsh,Navid Asadizanjani
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Failure analysis is being reshaped by heterogeneous integration, chiplet-based architectures, hybrid bonding, backside technologies, increasingly buried package structures. To examine how practitioners view this transition, an anonymous survey was distributed across a broad set of organizations involved in semiconductor design, packaging, systems, tools, failure analysis. The survey collected approximately one hundred responses probed organizational background, supported product domains, future priorities in failure analysis, critical bottlenecks, sample preparation challenges, emerging architecture specific pain points, perceived needs for workflow acceleration data standardization. The results show that heterogeneous integration, chiplet, and three-dimensional products dominate the respondent base at 69%, while package heterogeneous integration failure analysis received the highest importance rating at 7.92 out of 10. Hybrid bonding emerged as the most difficult new architecture to analyze at 54%, higher-resolution non-destructive imaging ranked as the most important future accelerator at 8.18 out of 10, and 83% of respondents supported formalized data standardization frameworks. The complete survey data are provided in Appendix A (Table II) to improve transparency support future benchmarking.

[CV-166] SAGE: An Expert-Annotated South Asian GI Endoscopy Dataset for Multimodal Learning and Hallucination Analysis

链接: https://arxiv.org/abs/2606.22144
作者: Niyoj Oli,Sachin Acharya,Sandesh Pokhrel,Sanjay Bhandari,Ramesh Rana,Nikesh Mani Shrestha,Ram Bahadur Gurung,Yash Raj Shrestha,Prashnna K Gyawali,Binod Bhattarai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gastrointestinal cancers represent a growing health burden in the South Asian region, driven largely by rapid changes in socio-economic conditions lifestyle habits. However, early diagnosis of such malignancies remains a significant challenge, largely due to a lack of modern equipment, lack of financial support, and a scarcity of GI experts. AI-assisted diagnosis report generation, show great promise in alleviating this problem by providing low-skill manpower the technical expertise to perform diagnosis. However, almost all open-source, publicly available datasets are predominantly collected from the European region, with no representation from the South Asian region. The lack of open-source GI datasets from diverse geographic regions has made it difficult to assess whether population bias is present in existing models, and to develop geographically inclusive AI tools for automated GI diagnosis. To address this gap, we introduce SAGE: An Expert-Annotated South Asian GI Endoscopy dataset for image captioning, multi-label classification, and visual question answering (VQA) tasks. It consists of 1,300 images, their captions along with hallucination tag, 18 labels and 14,726 question-answer pairs making it well-suited for diverse range of tasks including classification, benchmarking, and fine-tuning large multimodal models (LMMs). We further conducted benchmarking of multi-class classifiers on the effect of population shift in GI imaging AI tasks, and contemporary LMMs on their performance. Our study reveals that task-specific models, such as multi-class classification models, suffer the most, with an average performance drop of 58% when evaluated on the South Asian dataset. For contemporary LMMs, benchmarking reveals a substantial drop in the average GREEN score for anatomical landmark detection (0.308) and abnormality detection (0.410).

[CV-167] Feed-forward Motion In-betweening for Any 4D

链接: https://arxiv.org/abs/2606.22131
作者: Hiroki Nishizawa,Hubert P. H. Shum,Yoshihiro Fukuhara,Hirokatsu Kataoka,Shigeo Morishima
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Video: this https URL

点击查看摘要

Abstract:4D dynamics (3D geometry evolving over time) is a fundamental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity of large-scale, long-horizon 4D mesh data with arbitrary shapes, early text-to-4D methods rely on distillation or test-time optimization from video diffusion priors, making inference prohibitively slow. Recent feed-forward generators greatly reduce inference cost but offer limited spatiotemporal controllability, and short-horizon generation often leads to error accumulation in long-horizon sequences. We propose a novel feed-forward in-betweening framework for arbitrary 4D meshes with keyframe conditioning. Building on universal mesh-animation latents, we introduce a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh for keyframe conditioning. We further introduce a keyframe-conditioned rectified flow model with an MMDiT backbone that synthesizes non-keyframe frames conditioned on sparse keyframes. Experiments show strong performance and improved controllability on both DyMesh16 and DyMesh32 benchmarks.

[CV-168] Surgical Anatomy Recognition with Context Learning using Foundation Representations MICCAI2026

链接: https://arxiv.org/abs/2606.22124
作者: Ronald L. P. D. de Jong,Tim J. M. Jaspers,Raf A. H. Vervoort,Aron F. H. A. Bakker,Yiping Li,Jip L. Tolenaar,Jelle P. Ruurda,Willem M. Brinkman,Josien P. W. Pluim,Marcel Breeuwer,Daan de Geus,Fons van der Sommen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Provisionally accepted for presentation at MICCAI 2026

点击查看摘要

Abstract:Accurate recognition of anatomical structures is essential for safe and effective minimally invasive surgery (MIS), yet it remains underexplored in surgical computer vision due to limited annotated data and methods tailored primarily to natural scenes. In this work, we present a combined dataset and model framework to advance anatomy-aware perception in MIS. First, we introduce ATLAS-120k, a large-scale clip-level semantic segmentation dataset comprising over 120,000 annotated frames from 100 surgical videos spanning 14 procedures and multiple modalities, including laparoscopic and robot-assisted surgery. The dataset captures substantial procedural variability and was created using a scalable annotation pipeline that integrates expert manual labeling, automated propagation, iterative refinement, and surgeon verification to ensure high-quality annotations. Second, we propose ATLAS (Anatomy Recognition with Context Learning using Foundation Representations), a video semantic segmentation model specifically designed for surgical anatomy recognition. Unlike conventional approaches that emphasize object tracking, ATLAS leverages foundation-model embeddings together with lightweight temporal reasoning to incorporate contextual cues such as procedure type, surgical phase, and short-term visual memory. This design enables temporally consistent and accurate predictions while maintaining real-time feasibility. Together, the dataset and model establish a practical foundation for robust surgical scene understanding and support the development of clinically applicable guidance systems for minimally invasive surgery. The models, dataset annotations and annotation platform are publicly available at: this https URL.

[CV-169] Accurate identification and measurement of the precipitate area by two-stage deep neural networks in novel chromium-based alloys

链接: https://arxiv.org/abs/2606.22112
作者: Zeyu Xia,Kan Ma,Sibo Cheng,Thomas Blackburn,Ziling Peng,Kewei Zhu,Weihang Zhang,Dunhui Xiao,Alexander J Knowles,Rossella Arcucci
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
备注: 18 pages, 11 figures. Published in Phys. Chem. Chem. Phys

点击查看摘要

Abstract:The performance of advanced materials for extreme environments is underpinned by their microstructure, including the size and distribution of reinforcing phases. Chromium-based superalloys are a recently proposed alternative to conventional face-centred-cubic superalloys for high-temperature applications, such as Concentrated Solar Power, and their development requires efficient measurement of precipitate volume fraction and size distribution from electron microscopy images. Traditional fixed-threshold image processing is sensitive to background noise, generalises poorly across materials, and requires substantial manual measurement effort. To address these bottlenecks, this study proposes DT-SegNet, an end-to-end two-stage deep learning scheme based on YOLOv5 and SegFormer for object detection and segmentation in electron microscopy images. The approach combines the training efficiency of convolutional neural networks at the detection stage with the segmentation accuracy of a Vision Transformer. Numerical experiments show that DT-SegNet substantially outperforms state-of-the-art segmentation tools offered by Weka and ilastik across metrics including accuracy, precision, recall, and F1-score. The model provides a useful tool for alloy-development microstructure examinations and helps address the large datasets associated with high-throughput alloy development.

[CV-170] OphthaDT: Generative Digital Twins for Forecasting Visual Acuity Trajectories in Ophthalmology

链接: https://arxiv.org/abs/2606.22101
作者: Pietro Belligoli,Nikita Makarov,Sayedali Shetab Boushehri,Fabian Schmich,Raul Rodriguez-Esteban,Michael Menden
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precision medicine in ophthalmology requires accurate longitudinal predictions, but the fragmented nature of multimodal clinical data remains a barrier to forecasting. We introduce OphthaDT, an LLM-based digital twin for ophthalmology that serializes longitudinal patient histories from 3,220 patients across four Phase III clinical trials into structured narratives to forecast best corrected visual acuity (BCVA). In benchmarks spanning up to 100 weeks, OphthaDT demonstrated the lowest prediction error in neovascular age-related macular degeneration (nAMD), achieving an average mean absolute error (MAE) reduction of 6.0% compared to all baselines. In diabetic macular edema (DME), OphthaDT demonstrated competitive performance against all baselines while outperforming Random Forest and XGBoost by an average MAE reduction of 2.6% and 6.9%, respectively. Results reveal that OphthaDT’s predictive advantage scales with trajectory complexity: whereas linear models remain effective for the more stable treatment responses of DME, OphthaDT’s capacity is better suited for capturing the high longitudinal variability of nAMD. Finally, OphthaDT handles irregular sampling without imputation, positioning LLM-based clinical trajectory modeling as a methodology that could reduce patient burden and accelerate drug development.

[CV-171] Cross-View Yaw Estimation in Location Uncertainty with Line-Aligning Yaw Scoring

链接: https://arxiv.org/abs/2606.22094
作者: Taeho Kang,Nairan Zhang,Yelin Kim,Yujiao Shi,Youngki Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 15 figures

点击查看摘要

Abstract:Accurate yaw estimation is a bottleneck in cross-view localization between ground view and Bird’s Eye View (BEV). Existing methods couple yaw with translation and rely on height or projection assumptions that degrade under large yaw ambiguity. We disentangle yaw from location accuracy and introduce LAYS, a radially invariant line-consensus voting method. By exploiting the radial invariance of our formulation, we achieve sub-degree yaw precision via 3D voting over all candidate poses, while eliminating the need for accurate location. Our key observation is that a ground-image column matched to BEV pixels induces the same yaw across all camera positions along the radial direction of the pixels. LAYS matches BEV pixels to ground columns using feature similarity and accumulates the induced yaw votes into discrete 3D bins, where correct correspondences along the radial line concentrate into a sharp peak for the correct yaw. Experiments on Mapillary, Ford, KITTI, and VIGOR show significant gains under unknown yaw, particularly for normal FoV with unknown yaw (+28 \sim 45%p), and using LAYS as a yaw prior improves downstream 3-DoF localization.

[CV-172] BAC-JEPA: Label-Efficient Breast Arterial Calcification Segmentation via Synthetic Mammography-Guided Supervision

链接: https://arxiv.org/abs/2606.22089
作者: Scott Chase Waggener,Lakshman Tamil
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast arterial calcification (BAC) on screening mammograms is an emerging cardiovascular risk biomarker, but quantitative use requires reproducible segmentation and expert pixel-level labels are costly. We present BAC-JEPA, a label-efficient segmentation framework trained on procedurally generated arterial calcification inserted into real mammographic backgrounds with exact masks. Candidate backgrounds were selected from model-screened mammograms with low predicted BAC response; the generator samples arterial structure, disease burden, radiographic appearance, and hard-negative distractors including nonarterial calcifications and metallic objects. Synthetic masks are paired with mammography self-supervised Vision Transformer encoders and a high-resolution convolutional decoder to produce full-resolution segmentation maps. The study used 75,472 mammography studies from 34,956 patients for background selection and representation learning, trained on synthetic images from 10,000 backgrounds, selected checkpoints with 1,000 development backgrounds, and evaluated transfer on all 1,000 human-labeled BacSeg synthetic 2D mammograms. On held-out synthetic validation data, the larger backbone achieved IoU 0.5325 and Dice 0.6357. On BacSeg, image-level classification from segmentation probability maps reached AUROC 0.8719, with 0.8547 for the smaller backbone. Four-view inference required 110.68–213.63 ms on an RTX 5090 GPU, and severe-preset synthetic image generation averaged 2.7071 s per image on a multicore workstation. These results indicate that BAC-specific synthetic supervision can produce useful image-level transfer without human pixel-level training masks, while expert-reviewed real-mammogram segmentation remains necessary for clinical validation and calibration.

[CV-173] Morphology-Aware Multimodal Representation Learning for Insect Phylogenetic Reconstruction

链接: https://arxiv.org/abs/2606.22077
作者: Zixuan Liu,Kaijie Yu,Chun He,Xiaoxu Cai,Xinhai Ye,Haishuai Wang,Gongyin Ye,Jiajun Bu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, and 2 tables

点击查看摘要

Abstract:Morphological traits provide important evidence for phylogenetic reconstruction and evolutionary relationship analysis. Recent image-based approaches have introduced deep learning, particularly convolutional models, to derive morphological features from specimen images, but these methods generally rely on single-modality visual representations and do not explicitly incorporate morphological semantics. This study proposes a morphology-aware multimodal alignment framework for insect phylogenetic reconstruction. The framework combines specimen images with curated morphological descriptions by adapting a vision transformer through parameter-efficient fine-tuning and supervised contrastive learning, followed by image-text alignment in a shared latent space. The learned image embeddings are then used as continuous traits for Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across multiple visual backbones and feature adaptation strategies demonstrate that multimodal alignment improves topological agreement with the reference phylogeny. The results indicate that the proposed framework can derive morphology-aware visual traits for computational phylogenetic reconstruction.

[CV-174] Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

链接: https://arxiv.org/abs/2606.22076
作者: Jiahong Chen,Jinghao Wang,Ziwen Wang,Zi Wang,Banglei Guan,Qifeng Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.

[CV-175] A Controlled Study of CLIP-Based Body-Scene Fusion for Emotion Recognition in Context

链接: https://arxiv.org/abs/2606.22072
作者: Zubair Abbas,Muhammad Umair,Muqaddas Hameed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Apparent emotion in natural images is often not visible from the face alone. The face may be small, hidden, or neutral, while posture and scene context carry much of the evidence. This work studies context-aware emotion recognition on EMOTIC with an image-only two-stream model. A ResNet-18 body stream encodes the target-person crop, and a CLIP ViT-B/16 scene stream encodes the full image. The fused feature predicts 26 categorical emotion labels and the continuous valence, arousal, and dominance values. This study examines whether small context-debiasing or rare-class training changes still help after adding a CLIP scene encoder. The clean two-stream model is compared with simplified CCIM-style intervention, CLEF-lite context-bias subtraction, ASL tuning, and class-balanced sampling under the same implementation pipeline. No tested variant improves over the clean two-stream model, which achieves 34.52% mAP on the EMOTIC test split. CLIP gives the model broad scene semantics, but the simplified causal, counterfactual, and rare-class changes do not automatically improve performance. Most remaining errors are in rare and subtle emotion categories, so the next step should focus on label relationships and finer subject-context interaction.

[CV-176] When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR

链接: https://arxiv.org/abs/2606.22043
作者: Zekun Xu
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is increasingly applied to large vision-language models (LVLMs), yet outcome-only optimization can drive a model to stop attending to the video and instead exploit linguistic priors – a failure we call a visual shortcut. While the existence of such perception bypass is by now documented, how it forms, whether it can be undone, and when intervention still helps remain open. We treat the strength of a grounding penalty, lambda, as a control knob and characterize the formation-reversal dynamics of visual shortcuts along the training time axis. On a held-out, out-of-distribution diagnostic set, we find: (i) a sharp onset – shortcut reliance emerges abruptly over a narrow window of optimization steps and is robust across random seeds; (ii) a monotone dose-response – increasing lambda progressively suppresses the shortcut, and at an intermediate dose the trajectory first forms and then reverses the shortcut, exposing a hysteresis-like asymmetry between acquiring and removing it; and (iii) a critical intervention window – applying the penalty before onset arrests shortcut formation, whereas the same penalty applied after consolidation is markedly less effective. Together these results recast visual-shortcut collapse not as a binary defect but as a controllable, time-dependent, and asymmetric process, with direct implications for when and how strongly to regularize multimodal RLVR.

[CV-177] IDAG-Edit: Multi-Object Video Editing via Instance-Decoupled Attention and Guidance

链接: https://arxiv.org/abs/2606.22042
作者: Yuan-Zhih Lin,Huu-Thang Nguyen,Huu-Phu Do,Hong-Han Shuai,Ching-Chun Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based video editing has made significant progress; however, achieving precise and temporally consistent object-level control, especially in multi-object scenarios, remains challenging due to attention leakage, identity drift, and unstable temporal dynamics. In this work, we propose IDAGEdit, a training-free framework for fine-grained multi-object video editing with strong temporal consistency. The framework adopts Layout-guided Attention Modulation to facilitate coherent multi-object editing, while Instance-level Masks are introduced to preserve individual object identity and enforce localized attention within each object region, thereby enabling fine-grained, object-level editing. Extensive qualitative and quantitative evaluations demonstrate that our method improves temporal stability and multi-object controllability over state-of-the-art video editing approaches.

[CV-178] opological summaries of fingerprint ridge patterns carry identity information

链接: https://arxiv.org/abs/2606.22029
作者: Chad M. Topaz,Niny Arcila-Maya,Elizabeth Munch,Zofia Stanley,Lori Ziegelmeier
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Fingerprints are the most widely deployed biometric. Verifying whether two impressions come from the same finger typically relies on minutiae, small landmarks such as skin ridge endings and bifurcations. These landmarks are extracted through a multi-stage pipeline of image enhancement, skeletonization, minutiae detection, and alignment. We investigate an alternative: using topological data analysis to represent the full pattern of skin ridges and valleys directly, bypassing minutiae detection and the downstream matching pipeline. We apply persistent homology, a topological tool that tracks how loops in the ridge pattern form and fill in across spatial scales, producing multi-scale summaries of ridge geometry. We develop and compare a range of verification methods on a standard benchmark dataset, FVC2000 DB1. Even the simplest topological summaries, with no trained parameters, substantially outperform geometry-only baselines. A trained method achieves an AUC of 0.91, while an optimal-transport method excels at the strictest false-accept thresholds, suggesting they capture different aspects of the ridge pattern. Fusing these two approaches yields the best performance at every low false-accept threshold we examine. Our results establish that these topological summaries capture substantial fingerprint identity information, far more effective for verification than raw pixel-level geometry. Because the entire pipeline is openly specified, it offers a transparent complement to minutiae-based systems, and we provide a modular framework for constructing, evaluating, and combining topological verification methods.

[CV-179] One-Shot Data Selection for Medical Image Classification via Graph Coverag e MICCAI2026

链接: https://arxiv.org/abs/2606.22002
作者: Zahiriddin Rustamov,Nadia Badawi,Rafat Damseh,Nazar Zaki
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:Training medical image classifiers on entire datasets is wasteful when annotation budgets are limited: not all samples contribute equally, yet acquiring expert labels is expensive. Active learning reduces annotation cost through iterative querying, but assumes repeated access to an oracle and requires multiple rounds of model training. One-shot geometry-based methods such as facility location avoid retraining but operate on pairwise distances that ignore the local structure of the data manifold. We propose a graph-based one-shot selection method that operates entirely on frozen foundation model embeddings. Given embeddings from a pretrained encoder, we construct a k-nearest neighbor graph over all training samples and derive a two-term coverage kernel from the heat diffusion kernel, capturing both direct and two-hop neighborhood relationships. Greedy facility location on this kernel selects class-balanced subsets that maximize coverage of the data manifold. The two-term kernel matches the full spectral heat kernel in selection behavior while reducing computation to sparse matrix operations with a single hyperparameter. We evaluate on five MedMNIST datasets spanning histopathology, radiology, and microscopy, comparing against both training-dynamics and geometry-based baselines. Our method achieves the highest balanced accuracy on nine of ten dataset-ratio conditions, with the largest gains on class-imbalanced datasets where global graph construction captures cross-class structure that per-class methods miss, all without any model training during selection. Code is available at this https URL.

[CV-180] From Driving Videos to Simulatable Scenarios ITSC

链接: https://arxiv.org/abs/2606.21993
作者: Alexandre Levy,Ernest Valveny Llobet,Antonio Manuel López
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 11 figures and Accepted for publication at the IEEE International Conference on Intelligent Transportation Systems (ITSC), 2026

点击查看摘要

Abstract:Autonomous vehicles (AVs) face driving scenarios ranging from routine traffic to rare events. To assess safety it is crucial to reproduce these scenarios in a controllable, repeatable, and scalable manner, with simulation playing a key role. This paper introduces D-V2S, a novel framework that automatically generates simulatable driving scenarios from driving videos. D-V2S operates in two stages: a Driving Record Analyzer (DRA) uses a vision language model (VLM) with our designed prompt to produce natural-language descriptions from input videos, capturing road layouts and dynamic traffic interactions; subsequently, a Scenario Generator (SG) uses a large language model (LLM) and our conditioning context to translate these descriptions into executable scenarios. Using simulations, we show that D-V2S generates scenarios where 90% of the relevant semantic elements of the videos are present. We also provide qualitative results demonstrating D-V2S’s capability to transform real-world driving videos into simulatable scenarios. Moreover, we provide both semantic and human driven ablative analyses of D-V2S’s modules. In particular, we show how the VLM choice matters for DRA, and how our SG achieves a 75% preference rate over other state-of-the-art methods.

[CV-181] CoDMD: Copula-aware Distribution Matching Distillation for Fast Video Generation

链接: https://arxiv.org/abs/2606.21982
作者: Wenhu Zhang,Kun Cheng,Changyuan Wang,Shiyao Li,Yuechen Zhang,Wenbo Li,Jiajun Zha,Jingyi Zhang,Kang Zhao,Jiaya Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-step distillation for video diffusion models has attracted significant attention, driven by the urgent demand for efficient deployment in real-world scenarios. However, Distribution Matching Distillation (DMD), a leading paradigm, tends to degrade under limited NFE budgets, manifesting in video generation as layout instability, oversaturation, and broken motion dynamics. We trace this failure to a structural limitation: standard DMD is an intra-sample distribution-matching objective with coordinate-wise gradients, and thus imposes no explicit constraint on the relational geometry across batch elements or temporal frames, leaving the underlying copula largely unregulated. Combined with the mode-seeking tendency of its reverse-KL objective, this absence of relational guidance makes DMD prone to collapsing into local optima in the few-step regime. Motivated by this insight, we propose Copula-aware DMD (CoDMD), a lightweight relational regularizer that reuses score estimates already produced by the frozen teacher and the online fake model to construct pairwise relation matrices across samples and frames. These are matched through a supplementary distributional objective that requires no additional networks, datasets, or sampling trajectories. On the Wan-2.1-T2V model series at 1.3B 14B scales, CoDMD distills 50-step teachers into 4-step students, achieving an approximate 25 \times speed-up while attaining VBench scores of 84.46 84.87, outperforming prior trajectory-based (rCM 82.81 84.05) and distribution-based (DMD 83.38 83.81) methods.

[CV-182] Denoising-Enhanced Coarse-to-Fine Infrared Small Target Detection with Attention Prior-Guided Knowledge Distillation ECCV2026

链接: https://arxiv.org/abs/2606.21956
作者: Houzhang Fang,Ruixuan Huang,Qiuhuan Chen,Xiaolin Wang,Yi Chang,Luxin Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Infrared small target detection (IRSTD) in high-resolution images is crucial for many practical applications, such as surveillance of unmanned aerial vehicles (UAVs) and UAV-based ground monitoring. However, IRSTD remains challenging due to the small size and weak features of targets, as well as significant interference from complex dynamic backgrounds. Existing detection methods often suffer from redundant computations on non-target background regions and insufficient exploitation of target context information, which limits their performance in complex backgrounds. To address these issues, we propose an efficient coarse-to-fine infrared small target detection framework with attention prior-guided knowledge distillation, termed ECFNet. In the coarse stage, we design a region binary classification network (RBCN) on grid-based multi-scale feature maps to efficiently recognize target-containing context region proposals while suppressing complex backgrounds. Moreover, we introduce a novel denoising-assisted training strategy that incorporates noisy ground-truth (GT) masks into the feature maps of RBCN and trains the network to reconstruct the GT masks through a denoising task, thereby enhancing its ability to distinguish target proposals from background regions and accelerating convergence. In the fine stage, we customize a lightweight target detector to the coarse stage’s region proposals for balancing accuracy and efficiency. Furthermore, we propose a knowledge distillation strategy guided by the teacher-student cross-attention prior. This mechanism directs the student to focus on critical target regions, thereby enhancing the discriminative feature representation for infrared small targets. Extensive experiments on three real infrared datasets demonstrate that our method outperforms both existing single-stage and two-stage approaches while maintaining high real-time processing efficiency.

[CV-183] ScalePredictor: Instance-aware Scale Learning for Accurate Quantization of Vision Transformers

链接: https://arxiv.org/abs/2606.21947
作者: Changjun Li,Runqing Jiang,Lian Xu,Ye Zhang,Qingyong Hu,Yulan Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Transformers have achieved remarkable success in many fields, yet their deployment on edge devices remains challenging due to their substantial computational demands. Post-Training Quantization (PTQ) offers an attractive solution by compressing models using a small calibration set with minimal training overhead. However, most existing PTQ works adopt a static quantization paradigm that is uniformly applied to all instances. Given the substantial diversity of natural images, the activation distributions vary significantly across samples, making these methods inherently suboptimal. In this paper, we propose ScalePredictor, a dynamic quantization framework for accurate and efficient quantization scale learning of ViTs. We first reveal a hidden correlation between the distribution range of shallow-layer activations and the optimal scales of deeper layers. Based on this, we develop a scale learning mechanism that integrates an efficient range extraction approach to capture robust range statistics at the shallow stage, which are then fed into a Taylor-motivated polynomial scale projection module to generate all quantization scales simultaneously. With the efficiency of polynomial approximation, ScalePredictor introduces insignificant computational overhead while avoiding costly just-in-time calibration. Extensive experiments on ImageNet demonstrate that ScalePredictor consistently outperforms prior PTQ methods, achieving a more favorable accuracy-efficiency trade-off. Code and additional results are shown in the supplementary materials.

[CV-184] Artic-O: End-to-End Articulated Object Reconstruction via Latent Geometry Learning

链接: https://arxiv.org/abs/2606.21938
作者: Xuyang Wang,Zhenyu Li,Jian Ding,Habib Slim,Peter Wonka,Hongdong Li,Mohamed Elhoseiny
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing articulated objects from sparse images requires recovering complete geometry, movable parts, and motion parameters. Recent methods typically separate geometry reconstruction, part reasoning, and articulation estimation into different stages. This separation can weaken consistency between shape, active parts, and motion, while also incurring substantial inference cost. We introduce Artic-O, an end-to-end, feed-forward framework for articulated object reconstruction via latent geometry learning. Instead of fitting geometry in image or view space, Artic-O maps sparse multi-state observations into a pretrained latent geometry space, where a frozen flow-matching decoder provides a complete-shape prior for recovering visible and occluded structures. To connect geometry with articulation, Artic-O fuses visual tokens, geometry latents, and point-wise decoder features in an image-grounded part-reasoning module for active-part segmentation and articulation prediction. We further train the model with a geometry-to-articulation curriculum and a decoupled two-pass strategy to balance reconstruction and part-level supervision. On PartNet-Mobility, Artic-O achieves strong reconstruction quality while being substantially more efficient than LARM, a strong prior method. It reduces Chamfer Distance, improves F-score, and achieves comparable or better articulation accuracy across most joint metrics, while reducing inference time from 9 minutes to about 0.3 seconds per object.

[CV-185] CoSA: Correlation-Guided Change Attention with Learnable Residual Gating for Remote Sensing Change Detection

链接: https://arxiv.org/abs/2606.21932
作者: Abdirashid Omar,Jonghyuk Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures; published in IEEE Access. Code: this https URL

点击查看摘要

Abstract:Remote sensing change detection (CD) from bi-temporal imagery is critical for applications such as urban monitoring, disaster assessment, and environmental management, yet robust localization remains challenging under sparse changes, noisy labels, and appearance variations. In this paper, we propose Context Sampling Attention (CoSA), a lightweight decoder-side refinement module that explicitly leverages bi-temporal feature correlation as a control signal for adaptive change-aware feature enhancement. This differs from conventional attention mechanisms that rely on implicit feature weighting without explicit temporal control. In the implemented FC-Siam setting, CoSA computes normalized same-location cross-correlation between paired decoder features, converts low correlation into a change gate, and injects the resulting gated residual at native 1/8 and 1/16 feature scales through learnable residual scaling. This design enables effective discrimination between stable and ambiguous regions without relying on computationally expensive global attention. Extensive experiments on four benchmark datasets (LEVIR-CD, S2Looking, DSIFN, and CLCD) demonstrate consistent improvements over strong baselines, achieving 1.5-2.6% gains in changed-class F1 while introducing negligible parameter overhead. Ablation studies confirm that multiscale placement and learnable residual gating are both important for peak performance. These results indicate that CoSA establishes a practical and effective refinement paradigm for enhancing temporal discriminability in Siamese change detection frameworks.

[CV-186] GTA-Net: Cooperative Game Theory for Vision-Language Alignment in Chest X-Ray Report Generation

链接: https://arxiv.org/abs/2606.21915
作者: Saif ur Rehman Khan,Imad Ahmed Waqar,Sebastian Vollmer,Muhammad Nabeel Asim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated chest X-ray report generation requires precise cross-modal grounding to ensure clinically reliable descriptions. However, existing vision-language models rely on implicit attention mechanisms that fail to enforce explicit region-word correspondence and disease-level consistency. We propose Game-Theoretic Alignment Network (GTA-Net), a vision-language framework that formulates report generation as a cooperative game-theoretic alignment problem. The model introduces a BinaryGameAligner that models interactions between image regions and text tokens using similarity-based payoff matrices with Shapley-inspired importance weighting. To enforce clinical semantics, we further develop a Disease-Aware Ternary Aligner, which captures joint interactions among images, reports, and structured disease concepts. GTA-Net combines a Swin-based visual encoder with a LoRA-adapted large language model and is trained with a unified objective for generation and alignment. Experiments on CheXpertPlus and IU-XRay demonstrate state-of-the-art performance across standard generation metrics and improved clinical consistency, highlighting the effectiveness of explicit game-theoretic alignment for medical vision-language generation.

[CV-187] Rethinking the Adaptation of Vision Foundation Models for Efficient Cell Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.21913
作者: Qing Xu,Xiangjian He,Wenting Duan,Jiebo Luo,Zhen Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI 2026

点击查看摘要

Abstract:Cell segmentation is critical for computational pathology and biomedical discovery. While recent Vision Foundation Models (VFMs) have demonstrated remarkable universal feature representations, unlocking their full potential for cellular imaging is currently bottlenecked by resource-intensive adaptation paradigms. Existing methods typically rely on fine-tuning heavy visual encoders, leading to extensive computational overhead and a dependency on large-scale annotations. To address this, we propose the EffiCell-Seg framework for highly efficient cell segmentation without re-training the visual encoder. Our core insight is that pretrained VFMs intrinsically encode complementary structural priors: global saliency for localizing potential cells, and local morphological patterns for delineating cellular structures. To harness these priors, we devise a Cell Structure Prompt Encoder (CSP-Encoder) that synthesizes semantic-aware saliency and principal morphological features from frozen VFM representations into explicit structural prior maps. Moreover, we propose a Synergistic Mask Decoder (SM-Decoder) that enforces contextual consistency by jointly predicting geometric distance fields and semantic maps via mutual cross-guidance. Extensive experiments demonstrate that EffiCell-Seg outperforms state-of-the-art methods across diverse cell imaging modalities while requiring only ~5M trainable parameters, over 130x fewer than fully fine-tuned VFM counterparts. The code is available at this https URL.

[CV-188] Fidelity- and Perception-Aware Local Implicit Attention for Arbitrary-Scale Image Super-Resolution ECCV2026

链接: https://arxiv.org/abs/2606.21910
作者: Yu-Syuan Xu,Hao-Lun Sun,Hao-Wei Chen,Hsien-Kai Kuo,Chun-Yi Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Arbitrary-scale image super-resolution (ASISR) aims to reconstruct high-resolution images from low-resolution inputs over a continuous range of upscaling factors. While traditional pixel-regression approaches often produce overly smooth results that lack realistic details, recent diffusion methods can produce sharper and more realistic textures. However, these diffusion techniques frequently introduce the risk of structural hallucinations. To address these issues, we propose Fidelity- and Perception-Aware Local Implicit Attention (FPLIA), a framework that effectively integrates fidelity-oriented features into a diffusion pipeline to produce realistic and faithful reconstructions for ASISR. We introduce a Fidelity and Perception Attention Module (FPAM), which applies both self-attention and cross-attention to fidelity-oriented and perceptual features to enhance representational capacity. To further exploit their complements, we design a Fidelity and Perception Select Module (FPSM) that adaptively selects the most representative features for RGB values prediction. We conduct extensive experiments to validate the effectiveness of these components. Both qualitative and quantitative results show that FPLIA delivers superior perceptual realism while maintaining reconstruction accuracy on standard ASISR benchmarks. The source code is accessible at the following repository: this https URL.

[CV-189] Mesh2GS: White-Box 3DGS Construction via Plenoptic Sampling

链接: https://arxiv.org/abs/2606.21898
作者: Haoran Zhu,Youcheng Cai,Huangsheng Du,Jingyang Meng,Ligang Liu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising method for high-quality, real-time 3D reconstruction. To associate 3DGS with mesh representations, existing methods primarily focus on 3DGS-to-mesh reconstruction from multi-view images. In contrast, the problem of converting a mesh into 3DGS has received comparatively less attention. Instead of relying on heuristic strategies that bind 3D Gaussians to the mesh, we propose a novel white-box 3DGS construction framework, termed Mesh2GS, which generates 3DGS directly from mesh geometry based on plenoptic sampling theory, achieving Nyquist-level performance for high-quality global illumination rendering. Firstly, we propose a plenoptic sampling guided 3DGS construction strategy that theoretically derives the minimum sampling rate of the sampled views and the distribution of 3D Gaussians. Second, we propose a novel 3DGS update procedure with albedo–shading decomposition for efficient global-illumination capture. Finally, we introduce a neural illumination enhancement module to handle non-Lambertian effects. Experimental results demonstrate that our method surpasses state-of-the-art baselines and is practically effective for both real-time shared rendering and non-Lambertian effects capturing specular highlights. The project code will be released upon acceptance.

[CV-190] AgroSense 2.0: Cross-Modal Transformer Fusion with Geospatial Raster Integration and Interpretable Multi-Task Learning for Precision Crop Recommendation

链接: https://arxiv.org/abs/2606.21892
作者: Vishal Pandey,Rishav Tewari,Ruzina Haque Laskar
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 Pages, 3 pages

点击查看摘要

Abstract:Crop recommendation systems in precision agriculture have long suffered from a fundamental modality gap: visual soil characterization and chemical nutrient profiling are typically treated as independent inference problems, with fusion often reduced to late-stage feature concatenation. AgroSense~2.0 addresses this limitation through three architectural advances. First, we introduce continental-scale geospatial integration via a seven-band soil raster (\textttindia_soil\this http URL) spanning India, encoding Nitrogen, pH, SOC, Clay, Sand, Silt, and Bulk Density as 32\times32 spatial patches, a modality entirely absent from prior work. Second, we replace naive feature concatenation with a cross-modal Transformer fusion module, where tabular nutrient features attend over image representations via multi-head attention, enabling richer inter-modal dependency modeling than shallow fusion. Third, we adopt a multi-task objective jointly optimizing soil classification and crop recommendation through a shared backbone, improving generalization via complementary cross-task signal. To enhance interpretability, we apply TreeSHAP to the tabular branch, revealing crop-conditioned nutrient sensitivity: humidity and rainfall emerge as the most influential features globally, while crop-specific profiles diverge meaningfully rainfall dominates rice, nitrogen and potassium dominate maize, and humidity and nitrogen dominate coffee. These explanations provide transparency into model decisions and surface both agronomically consistent patterns and dataset-specific divergences worth further study. Together, these contributions establish AgroSense~2.0 as a more principled, interpretable, and geospatially grounded framework for precision agriculture.

[CV-191] Prompt-Calibrated SAM 3 for Open-Vocabulary Remote Sensing Semantic Segmentation

链接: https://arxiv.org/abs/2606.21863
作者: Yanghui Song,Nanqing Liu,Haonan Yin,Yingjie Gao,Chengfu Yang,Qi Ming
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures. This is the revised version of a manuscript currently under review for publication in IEEE Geoscience and Remote Sensing Letters (GRSL)

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) in remote sensing images aims to segment categories beyond a fixed label space. Recent SAM 3-based methods provide a promising training-free foundation, yet three key issues remain: (1) a single class-name prompt lacks sufficient semantic coverage for complex remote sensing categories; (2) expanding each category into multiple prompts introduces redundant online text encoding; and (3) directly aggregating multiple prompt responses propagates noisy activations into the final prediction. To address these issues, we propose ProC-SAM3, which calibrates SAM 3’s prompt interface for remote sensing OVSS from three complementary aspects. First, we construct an offline prompt pool where a Category Matcher groups MLLM-generated candidates into per-category sets, and Expansion Constraints further refine each set using category-specific prior knowledge. Second, the resulting text embeddings are cached and reused across all test images, eliminating repeated text encoding. Third, we introduce Presence-Guided Residual Fusion to gate unreliable decoder outputs by prompt presence and confidence, followed by peak-preserving class aggregation that retains fine-grained activations for small and sparse objects. Experiments on eight benchmarks show that ProC-SAM3 achieves an average mIoU of 56.1%, outperforming the previous best training-free method by 3.9 percentage points. Code will be available at this https URL.

[CV-192] Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization CVPR2026

链接: https://arxiv.org/abs/2606.21861
作者: Aman Goyal,Kshama Nitin Shah,Kemmannu Vineet Venkatesh Rao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, including supplementary material. Presented as a non-archival paper at the CV4Edu Workshop, CVPR 2026

点击查看摘要

Abstract:Automated classroom engagement recognition holds substantial promise for scalable learning analytics, yet the suitability of modern Vision-Language Models (VLMs) for this task under zero-shot conditions remains largely unexplored. We present a systematic benchmark that evaluates five widely-used VLMs: CLIP, BLIP-VQA, GPT-4o, LLaVA-1.5-7B, and Qwen2.5VL-7B-Instruct across two complementary educational datasets: DAiSEE, an individual-student video dataset (300 sampled test clips), and the Student Classroom Behaviour dataset (SCB, 1,168 scene-level images). Each model is probed with three prompt variants spanning minimal, rubric-anchored, and chain-of-thought designs. Our experiments reveal three primary failure modes of zero-shot VLMs for engagement recognition: (1) near-random performance on individual students, with Cohen’s kappa never exceeding 0.10 on DAiSEE; (2) severe class collapse, where models assign 85-100% of predictions to a single engagement level regardless of visual content; and (3) extreme prompt sensitivity, with accuracy swings of up to 32 percentage points on identical images depending solely on prompt phrasing. Remarkably, scene-level classification on SCB is substantially more tractable: CLIP and GPT-4o achieve kappa approximately 0.60 when prompted with behaviorally-grounded rubrics. We also document a practical barrier for deployment: GPT-4o’s safety filters reject 98% of chain-of-thought requests involving individual student faces. Our findings provide a calibrated baseline and surface critical design considerations for the use of VLMs in educational observation systems.

[CV-193] Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification CVPR2026

链接: https://arxiv.org/abs/2606.21838
作者: Zhiyuan Tao,Srikumar Sastry,Matthew J Thompson,Elizabeth G Campolongo,Net Zhang,Ziheng Zhang,Hilmar Lapp,Yu Su,Tanya Berger-Wolf,Nathan Jacobs,Wei-Lun Chao,Jianyang Gu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 FGVC Workshop

点击查看摘要

Abstract:Multimodal contrastive learning has enabled zero-shot visual classification by aligning images with textual categories. However, in hierarchically structured label spaces, existing methods often produce predictions that are inconsistent across taxonomic levels. For example, a model may predict a fine-grained category whose parent category contradicts its simultaneously predicted higher-level label. By analysis, the issue originates from false negative labels when contrastive comparison involves multiple taxonomic levels. To this end, we propose to restrict contrastive comparisons to categories within the same taxonomic level. In addition, we adopt a group-balanced design, ensuring each taxonomic level receives adequate optimization. As a result, the proposed framework improves both hierarchical consistency and classification accuracy from coarse to fine granularity. We train our model with TreeOfLife-10M based on BioCLIP and evaluate it across multiple hierarchical classification benchmarks, where the model demonstrates significantly improved hierarchical consistency in both Euclidean and hyperbolic spaces. Notably, on iNaturalist 2021 (iNat21), our method improves average accuracy across levels by 30.47% over the baseline, highlighting its effectiveness for hierarchical zero-shot classification.

[CV-194] RAPID: A Reproducible Multi-Agent Pipeline for Interpretable Disaster Damage Assessment from Satellite and Street-View Imagery

链接: https://arxiv.org/abs/2606.21819
作者: Yifan Yang,Wenjing Gong,Kaili Zhang,Lei Zou,Zhengzhong Tu,Hao Li,Zongrong Li,Xinyue Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Due to the increasing frequency and intensity of extreme climate events, there is a clear demand for intelligent, scalable, and autonomous approaches to disaster damage assessment. Existing methods, largely based on supervised learning and task-specific fine-tuning, struggle to generalize under domain shifts, long-tailed data distributions, and heterogeneous geospatial data sources, especially in disaster scenarios. They also often lack the ability to integrate and reason across multimodal geospatial information, such as satellite images and street-view images. In this paper, we introduce RAPID, a reproducible multi-agent pipeline for interpretable disaster damage assessment, including damage-level assessment, damage-type interpretation, and actionable suggestions for response, remediation, and recovery. RAPID coordinates specialized agents to perform cross-view understanding, image restoration, structured damage recognition, and geographical reasoning across heterogeneous data modalities. Without task-specific fine-tuning, RAPID supports zero-shot damage assessment by jointly using complementary information from remote sensing and ground-level perspectives. The system produces fine-grained, interpretable assessments and automatically generates location-specific, decision-relevant disaster reports to support early-stage emergency response. We evaluate RAPID across hurricanes, floods, wildfires, and earthquakes using multiple cross-view imagery inputs, including pre- and post-disaster street-view images, post-disaster remote sensing imagery, and street-view image pairs. Experiments show that RAPID achieves 0.92 overall accuracy for multi-disaster type classification and up to 0.627 for cross-view damage severity prediction, highlighting its potential as a foundational framework for autonomous disaster intelligence.

[CV-195] Rotation-Aware Point-Cloud Embeddings for Vision-Based In-Hand Reorientation

链接: https://arxiv.org/abs/2606.21788
作者: Yashom Dighe,Karthik Dantu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point-cloud goals provide a direct way to specify dexterous in-hand reorientation: instead of defining an object-specific pose frame or estimating 6D pose at test time, the policy is given the desired 3D geometry of the object. Yet raw point-cloud goal conditioning is poorly conditioned for policy learning. Current and goal clouds are unordered, independently sampled, and often visibility-dependent, so their discrepancy entangles object rotation with permutation, resampling, and unstable correspondence structure. For this reason, prior point-cloud manipulation methods typically add structure outside the representation itself, such as explicit pose or relative-pose inputs, dense flow features, or distillation from privileged teachers. We close this gap by learning a rotation-aware point-cloud embedding whose Euclidean latent distance is calibrated to the SO(3) geodesic error between object orientations. The resulting representation turns current-goal comparison into a smooth control signal, allowing a model-free RL policy to act from current and goal point-cloud embeddings, proprioception, and centroid metadata, without object pose, relative pose, dense flow, or teacher-action supervision. In in-hand reorientation experiments, this interface matches privileged-state and distillation-based baselines while avoiding brittle test-time computation of structured pose or flow inputs. These results suggest that point-cloud goals become practical for this task when the representation, rather than an external module, encodes the task-relevant geometry of rotation. We also show evidence that generic visual point-cloud pretraining is insufficient for such a current-goal comparison because it discards the task-relevant state and preserves only shape features.

[CV-196] Motion-Aware Reinforcement Learning For Object Localization

链接: https://arxiv.org/abs/2606.21764
作者: Prithvi Raj Singh,Satyendra Singh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures, 9 Tables

点击查看摘要

Abstract:We present MARLNet (Motion-Aware Reinforcement Learning Network), a PPO-based bounding-box refinement agent that incorporates a constant-velocity motion prior into the observation state and an action smoothness penalty into the reward function. The agent operates on 268-dimensional observations encoding the current proposal, a kinematic prediction, the previous action, and a 256-dimensional EfficientNet-B0 crop feature, and learns a five-dimensional policy controlling coordinate adjustments and a binary termination trigger. Evaluated on Pascal VOC 2012 and VisDrone 2019, MARLNet trains stably across all regularization strengths tested and achieves consistent gains in detection success rate at \textIoU \geq 0.5 : up to +0.011 on VOC ( \lambda_\textphys=0.10 ), where the motion prior prevents the overshooting that causes plain PPO to regress on this metric, and +0.007 on VisDrone ( \lambda_\textphys=0.70 ), where unconstrained PPO achieves a larger gain ( +0.025 ) owing to the weaker base detector. Through reward design ablations and training dynamics analysis, we identify a reward interference in which combining a constant-velocity deviation penalty with an absolute IoU term causes trigger collapse, and show that replacing it with the action smoothness penalty resolves this failure. We further characterize a representational ceiling facing crop-feature refinement agents that share a backbone with their base detector, confirmed through a global-plus-local observation ablation. Project page: this https URL

[CV-197] From Gradient Clipping to Structural Refinement: Improving DPSGD for Medical Image Segmentation

链接: https://arxiv.org/abs/2606.21763
作者: Shiva Parsarad,Parth Shandilya,Isabel Wagner
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is widely used for disease detection but relies on sensitive data, raising privacy concerns as trained models can leak information. Differential privacy, typically implemented via Differential Private Stochastic Gradient Descent (DPSGD), provides a solution, though at the cost of reduced utility. Recent DPSGD variants, including Automatic clipping (Auto-S), Normalised SGD with perturbation (NSGD), and Per-sample adaptive clipping (PSAC), have shown promise in image classification, but their behavior in medical segmentation remains underexplored. We evaluate these methods across binary and multi-class tasks and analyze gradient alignment, showing that prior assumptions, particularly for PSAC, do not consistently hold. We further demonstrate that combining clipping strategies with morphological refinement improves segmentation quality under privacy constraints. Finally, we propose an adaptive DP-Morph variant that captures class-specific structures and enhances performance in multi-class settings.

[CV-198] Scene-Level Heterogeneous Physics Simulation with 3D Gaussian Splats CVPR2026

链接: https://arxiv.org/abs/2606.21753
作者: Xiaoyang Liu,Shangzhe Wu,Kai Han
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has achieved state-of-the-art photorealistic rendering, but the representation gap prevents these assets from being physically interactive. Production-grade physics engines do not understand the 3DGS representation, while prior physics-for-3DGS methods are monolithic silos. These prior works are fundamentally limited, demonstrating only object-centric physics in isolated environments, such as on an ideal plane. They are incapable of interacting with complex static collision geometry or heterogeneous assets. We propose a novel framework that, for the first time, bridges this gap by enabling 3DGS assets to participate in scene-level, heterogeneous, multi-solver physical simulations. Our core contribution is a Representation Abstraction Framework that translates all diverse assets, including 3DGS, virtual meshes, and fluids, into a unified physical particle set. This abstraction is key to enabling complex behaviors, such as the non-rigid deformation of 3DGS assets, within a unified physics pipeline. This particle set, along with the static scene collision boundaries derived from scene capture, is processed within a solver-agnostic physics kernel. The physical results are then mapped back to drive each asset’s specific visual reconstruction. This architecture unlocks capabilities impossible with prior art. We demonstrate complex, two-way interactions between deformable 3DGS assets, standard CG assets such as fluids and meshes, and large-scale captured static environments, showcasing realistic coupled phenomena that were previously unattainable.

[CV-199] Quantile Adaptive Temperature Scaling for Confidence Calibration

链接: https://arxiv.org/abs/2606.21749
作者: Omprakash Chakraborty,Leo Fillioux,Ismail Ben Ayed,Jose Dolz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks often produce poorly calibrated confidence estimates, overstating their certainty even when predictions are incorrect. Temperature Scaling remains the most widely used posthoc calibration method due to its simplicity and effectiveness, yet its global, uniform rescaling of logits fails to correct the highly heterogeneous structure of miscalibration observed across the confidence spectrum. In particular, the largest correctness confidence discrepancies arise in different quantile regions depending on the setting, low confidence predictions, where uncertainty matters most, tend to exhibit the largest correctness confidence discrepancies, which standard TS leaves largely unaddressed. We introduce Quantile Adaptive Temperature Scaling (QaTS), a simple and efficient post hoc calibration method that adapts the temperature as a function of a predictions empirical confidence quantile. By mapping confidences into the quantile space, QaTS normalizes the calibration problem, makes the structure of miscalibration explicit and enables a monotone temperature function that adapts across quantiles while leaving well calibrated high confidence predictions largely unchanged. preserving high confidence behavior. This quantile aware formulation aligns naturally with a reparameterized Expected Calibration Error (ECE) objective and yields a sample wise temperature that is robust across a variety of challenging scenarios, such as class imbalance and distributional shifts. Across a broad range of datasets, architectures, evaluation scenarios and diverse tasks, QaTS consistently, and substantially, outperforms state of the art post hoc calibration methods, delivering more reliable and trustworthy confidence estimates without modifying model predictions.

[CV-200] Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization CVPR2025

链接: https://arxiv.org/abs/2606.21736
作者: Zhipeng Xu,De Cheng,Xinyang Jiang,Nannan Wang,Dongsheng Li,Xinbo Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, accepted by CVPR 2025

点击查看摘要

Abstract:Single domain generalization (SDG) aims to learn a robust model, which could perform well on many unseen domains while there is only one single domain available for training. One of the promising directions for achieving single-domain generalization is to generate out-of-domain (OOD) training data through data augmentation or image generation. Given the rapid advancements in AI-generated content (AIGC), this paper is the first to propose leveraging powerful pre-trained text-to-image (T2I) foundation models to create the training data. However, manually designing textual prompts to generate images for all possible domains is often impractical, and some domain characteristics may be too abstract to describe with words. To address these challenges, we propose a novel Progressive Adversarial Prompt Tuning (PAPT) framework for pre-trained diffusion models. Instead of relying on static textual domains, our approach learns two sets of abstract prompts as conditions for the diffusion model: one that captures domain-invariant category information and another that models domain-specific styles. This adversarial learning mechanism enables the T2I model to generate images in various domain styles while preserving key categorical features. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art single-domain generalization approaches.

[CV-201] HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

链接: https://arxiv.org/abs/2606.21734
作者: Awais Rauf,Ahmed Hasssan,Greg Slabaugh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Understanding long videos requires fine-grained perception and multi-step, higher-order reasoning over complex, long-range spatio-temporal dynamics. Vision-language models (VLMs) encode video frames into visual tokens and attempt to perform both perception and multi-step planning latently, within a single forward pass. This coupled formulation, however, is bottlenecked by the LLM’s limited capacity to discover and execute multi-step strategies in its latent representations. To address this bottleneck, we propose Hierarchical Programmatic Probing (HPP), a framework that decouples semantic perception from higher-order temporal reasoning by reformulating long video understanding as iterative, programmatic exploration of a hierarchically segmented video. Specifically, a coding-capable LLM plans and executes a multi-step strategy in an interactive coding environment, probing the video for information and invoking a VLM for localized perception on demand. To make probing tractable over long videos, we introduce three components: information-density-aware hierarchical segmentation, late-interaction semantic retrieval, and structured probing functions for coarse-to-fine temporal localization. We validate HPP on LongVideoBench, which requires both fine-grained perception and long-range relational reasoning, and show that decoupling the two via iterative programmatic probing yields substantial gains. Further results on EgoSchema, VideoMME, and MLVU demonstrate the effectiveness of our approach across diverse long-video benchmarks.

[CV-202] Structural Assessment for Understanding and Guiding Dataset Distillation in Discrete Token Space ECCV2026

链接: https://arxiv.org/abs/2606.21705
作者: Yue Cao,Jianyang Gu,Vyacheslav Kungurtsev,Yu Hu,Jozsef Hamari,Zheng Liu,Mohsen Zardadi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Dataset distillation (DD) has proven to reduce training cost while preserving accuracy. While promising, the factors that make one distilled dataset more effective than another remain poorly understood. In this work, we investigate this question through the lens of discrete visual tokenizers. Whereas many prior DD efforts emphasize matching global data distributions, we suggest that the effectiveness depends on which semantic concepts are captured and how they are composed. Discrete visual tokenizers provide a finite vocabulary that enables direct statistical analysis of such compositional structure. Through quantitative analysis of token-level statistics, we introduce the structural score to measure the adequacy of token compositions. We observe that distilled datasets with balanced token composition yield higher validation performance. On the other hand, divergence from the original data does not necessarily harm performance. We further show that samples with high structural scores in the discrete token space can effectively guide diffusion-based DD. Our findings highlight the importance of token composition in dataset effectiveness, offering a principled complement to distributional similarity considerations in DD.

[CV-203] VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2606.21700
作者: Xuan Qi,Daniele Berardini,Dario Serez,Vito Paolo Pastore,Vittorio Murino
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) aims to learn a target-domain classifier from labeled source data and unlabeled target data under distribution shift. Recent diffusion-based UDA methods approach this problem by synthesizing labeled target-style images and training on the resulting synthetic data. However, their performance depends heavily on the conditioning design: class prompts provide only coarse guidance, while domain adaptation modules mainly control appearance, which may leave target-style synthesis insufficiently specified. We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided UDA. Instead of relying only on text prompts, VT-DUDA uses source images to provide additional instance-level visual context for target-style synthesis. Specifically, VT-DUDA maps each source image to a compact sequence of visual tokens and forms a hybrid conditioning context by concatenating these tokens with the corresponding text embeddings along the cross-attention context dimension of a latent diffusion model. This provides instance-dependent conditioning beyond text alone, while synthesis is performed with the target-domain adapter branch. Because guidance is represented explicitly as a token sequence, the same interface also permits inference-time manipulation of the conditioning signal through token selection and token-strength adjustment. The proposed method preserves the standard diffusion objective and can be integrated into existing adapter-based diffusion frameworks without modifying the backbone. Across Office-31, Office-Home, and VisDA-2017, VT-DUDA improves average target-domain accuracy over strong discriminative and diffusion-based UDA baselines. The results suggest that, in generation-based UDA, a stronger conditioning interface can improve the downstream usefulness of synthetic target-style data.

[CV-204] Enlight: Fast Low-Light Image Enhancement via Multi-Objective Optimization and Shadow-Aware Refinement

链接: https://arxiv.org/abs/2606.21674
作者: Nirjhor Datta,M. Sohel Rahman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present ENLIGHT, a fast and training free framework for low-light image enhancement based on direct optimization of a perceptual objective. Unlike deep learning approaches that require large scale training data and supervision, ENLIGHT operates in a zero-shot manner by optimizing image quality at inference time. The method employs a two stage global to local optimization strategy. In the first stage, ENLIGHT performs global illumination adjustment to improve visibility while maintaining structural consistency and avoiding excessive noise enhancement. In the second stage, a shadow aware refinement selectively improves low-intensity regions through masked local optimization, enhancing visibility without overexposure. To balance quality and efficiency, we introduce two modes: Fast, which uses a multi-objective formulation combining entropy, gradient preservation, and noise regularization, and Ultrafast, which reduces computational cost via a lightweight approximation of the same objective. The framework is optimizer agnostic and supports both evolutionary and lightweight local search methods. Experiments on BAID, Backlit300, LIME, MEF, NPE, and DICM demonstrate that ENLIGHT achieves competitive perceptual quality (MUSIQ, NIQE, BRISQUE) with significantly lower inference time. Qualitative results further show improved contrast, preserved structural details, and controlled noise amplification, making ENLIGHT a practical and interpretable alternative to learning based methods.

[CV-205] UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

链接: https://arxiv.org/abs/2606.21661
作者: Jiehui Huang,Yuechen Zhang,Bin Xia,Jiahao Wang,Xu He,Zhenchao Tang,Meng Chu,Xin Tao,Pengfei Wan,Jiaya Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.

[CV-206] A DVDrive Approach for doScenes Instructed Driving Challenge

链接: https://arxiv.org/abs/2606.21623
作者: Zijian Fu,Xiangyang Chu,Mengshi Qi,Huadong Ma,Guanghao Zhang,Wei Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction-conditioned trajectory prediction is an emerging problem in autonomous driving, where a model predicts the future ego trajectory not only from visual scene context and historical motion, but also from a natural-language maneuver instruction. This paper presents our submission to the doScenes Instructed Driving Challenge, built upon OmniDrive, a vision-language-action driving agent with 3D perception, reasoning, and planning capabilities. We adapt OmniDrive to the doScenes setting by training it on instruction-annotated nuScenes scenes and generating a 6-second ego trajectory represented by 12 future waypoints. To improve multi-view visual grounding, we further introduce a DVPE-style divided-view perception module into the OmniDrive perception head. Instead of attending globally to all camera features, the proposed module groups query features and image tokens into divided local view spaces and performs visibility-aware cross-attention within each view. This design reduces irrelevant cross-view interference and helps the model better align language instructions with local driving-relevant visual evidence. The code is publicly available at: this https URL.

[CV-207] Cross-Modal Corroboration for Annotation-Free Wildlife Monitoring CVPR

链接: https://arxiv.org/abs/2606.21613
作者: Bharath Pillai,Varun Viswapriyan,Christopher Stewart,Tanya Berger-Wolf,Jenna Kline
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at the 2026 CV4Animals Workshop, colocated with CVPR

点击查看摘要

Abstract:Scaling wildlife monitoring for real-world conservation deployments requires automated analysis of smart sensors that operate under severe annotation scarcity. We propose leveraging expert knowledge of species activity patterns as an annotation-free validation signal for multimodal monitoring pipelines. We operationalize agreement as the alignment of independently derived hourly activity curves both with each other and with published behavioral priors-a three-way convergence that rules out shared-data confounds and dataset-internal correlation as alternative explanations. Our vision pipeline combines zero-shot species detection via BioCLIP 2, sliced inference to handle deployment-constrained camera positioning, and geometry-based geographic localization from camera trap imagery. Our acoustic pipeline detects species vocalizations via a fine-tuned classifier. We validate the pipeline on a breeding herd of Milu deer and demonstrate that both modalities independently recover activity patterns consistent with known deer behavioral ecology with minimal manual annotation. The framework applies to species detectable in both visual and acoustic modalities for which behavioral priors are documented in the literature, suggesting a practical path toward self-validating wildlife-monitoring pipelines at conservation scale.

[CV-208] CurvSegFlow: Time-Conditioned Flow Matching for Robust Segmentation of Curvilinear Structures in Noisy Biomedical Images

链接: https://arxiv.org/abs/2606.21608
作者: Sidi Mohamed Sid’El Moctar,Achraf Ait Laydi,Alexandre Beber,Marcus Braun,Zdenek Lansky,Yousef El Mourabit,Helene Bouvrais
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate segmentation of curvilinear structures remains challenging in biomedical imaging due to their thin geometry, complex topology, and sensitivity to noise. This is particularly critical for microscopy images of cytoskeletal network, where low signal-to-noise ratios and dense filament crossings often lead to fragmented or inaccurate segmentation. In this work, we propose CurvSegFlow, a segmentation framework based on time-conditioned flow matching. Instead of predicting a segmentation mask in a single pass, the method models segmentation as a dynamic process that progressively refines a noisy initialization into the target structure through a learned velocity field. The proposed model combines a U-Net backbone with triple-term loss function and temporal embeddings to guide the refinement process across reconstruction stages. This formulation enables gradual error correction and improves the continuity of thin structures. CurvSegFlow is evaluated on multiple synthetic and real microtubule datasets, as well as on public benchmarks of retinal vessels, corneal nerves and coronary arteries. Across datasets, the method achieves competitive or superior performance compared to established segmentation models, with consistent improvements in precision and structural continuity, particularly under low signal-to-noise conditions. These results show that flow-based iterative refinement provides a robust and general framework for curvilinear structure segmentation. Overall, the proposed approach improves segmentation quality in challenging imaging conditions and generalizes effectively across modalities without architectural changes.

[CV-209] -MOR: Learning Motion-Aware Skeleton Representations for Human Action Recognition

链接: https://arxiv.org/abs/2606.21607
作者: Di Yang,Mahmoud Ali,Quan Kong,Gianpiero Francesca,Francois Bremond
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models such as CLIP have recently achieved strong performance on a wide range of visual understanding tasks. However, most existing models rely primarily on appearance-level supervision from images or videos, and do not explicitly model human motion, which is essential for fine-grained and human-centric action recognition task as actions are defined by temporally structured and physically grounded body movements. To address this problem, we propose Transferable skeleton MOtion Representation (T-MOR), a motion-aware framework that learns transferable action representations from skeleton sequences with the aid of video and language supervision during training. T-MOR adopts a multi-modal contrastive learning scheme that aligns skeleton motion with visual and textual representations, while performing inference using only lightweight skeleton inputs. To support large-scale pre-training, we construct PoseCap-1M, a new dataset that contains over one million synchronized video, skeleton, and text triplets covering diverse human activities. We evaluate T-MOR on a range of human-centric action recognition benchmarks, including action classification and frame-wise temporal detection. Experimental results show that T-MOR consistently improves performance across multiple datasets, such as Toyota Smarthome, Penn Action, UAV-Human, TSU, and Charades. In addition, T-MOR demonstrates strong generalization ability in few-shot and zero-shot settings, highlighting the effectiveness of motion-centric and embodied representations for transferable action understanding.

[CV-210] μMatch: Foundation Models for Semi-supervised Learning and Domain Adaptation in EM

链接: https://arxiv.org/abs/2606.21605
作者: Marei Freitag,Olesia Korchevaia,Luca Freckmann,Anwai Archit,Constantin Pape
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models have substantially advanced computer vision, enabling state-of-the-art performance in zero- and few-shot settings. They have been successfully applied to biomedical imaging tasks ranging from organ segmentation in computed tomography to cell segmentation in light microscopy. Electron microscopy (EM) is a central modality for analyzing cellular ultrastructure due to its nanometer-scale resolution. However, the application of foundation models in EM has so far been limited to specific organelles, such as mitochondria, largely due to the diversity of segmentation tasks and the scarcity of comprehensively annotated data. As a result, EM segmentation still predominantly relies on supervised learning, requiring extensive manual annotation and limiting ultrastructural analysis. To address this gap, we propose \mu Match, a framework for semi-supervised learning and domain adaptation that leverages foundation models. We implement state-of-the-art student-teacher-based methods and evaluate multiple foundation models (SAM, SAM2, \mu SAM, DINOv2/v3) on challenging EM tasks, including mitochondrion, nucleus, and neurite segmentation. Our results demonstrate consistent improvements over strong baselines and highlight a path toward substantially reducing the annotation effort in EM.

[CV-211] ϕ-Scene: Physically Grounded Image-to-3D Scene Reconstruction

链接: https://arxiv.org/abs/2606.21596
作者: Haodong Li,Lulu Shao,Haolin Lu,Yu Fu,Yen-Ru Chen,Seemandhar Jain,Manmohan Chandraker
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing compositional 3D scenes from a single image is a fundamental challenge in 3D world modeling. Recent methods can recover high-fidelity, complete 3D objects and predict plausible scene arrangements, but most still treat scene reconstruction primarily as a visual and geometric prediction problem. Their outputs may therefore contain floating objects, interpenetrations, or unstable-contact artifacts, limiting their physical validity and downstream usability in simulation, robotics, and interactive environments. We present \phi -Scene, a physically grounded approach to open-vocabulary and compositional image-to-3D scene reconstruction. The key premise is that a reconstructed scene should not be treated merely as a set of objects with predicted poses, but as a stable physical system. Accordingly, \phi -Scene formulates reconstruction as topology-driven physical assembly: it infers how objects support one another, orders them accordingly, and progressively settles each object against its already stabilized support context. For each object in topological order, SDF-based optimization first resolves penetrations against the pre-settled support context, and rigid-body simulation then settles the object into a stable contact configuration under real-world physical constraints. Experiments on 3D-Front show that \phi -Scene achieves the strongest overall performance among out-of-domain methods and remains highly competitive with in-domain baselines. Human and VLM evaluations further show strong preference for \phi -Scene in visual quality, reference alignment, and physical plausibility. Finally, dedicated physical plausibility metrics covering static contact quality and dynamic stability demonstrate that \phi -Scene substantially reduces penetration artifacts while producing much lower post-simulation drift, indicating more stable and physically grounded 3D scenes.

[CV-212] Boundary-by-Mask: Few-Shot Instance Segmentation with Mask-Conditioned Boundary Learning for Texture-Poor Industrial Parts IROS2026

链接: https://arxiv.org/abs/2606.21594
作者: Yutaka Yoshinaga,Naoya Chiba,Koichi Hashimoto
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 8 figures, accepted to IROS 2026

点击查看摘要

Abstract:Recent advances in large pre-trained models have led to remarkable progress in instance segmentation on general images. However, industrial scenarios remain challenging. Instance definitions are often application-specific and inconsistent, and the domain gap from general imagery is substantial due to weak textures and limited contextual cues. Consequently, a direct application of existing models is unreliable. We propose Boundary-by-Mask, a few-shot instance segmentation framework that supervises boundaries instead of interior appearance. Given a few RGB images and corresponding instance masks, the method extracts rich visual features using a foundation-model encoder and trains a lightweight Signed Distance Function (SDF) head to predict boundary-aware distance maps. Segmentation masks are obtained through an SDF-to-mask reconstruction process. By explicitly estimating contours, the framework achieves reliable instance separation even on low-texture and color-uniform surfaces. The instance definition is conditioned by the instance mask. Replacing the mask specifies the segmentation target, such as the whole object or a sub-part. A pixel-wise shallow MLP head enables rapid training. Experiments on industrial parts and food items with ambiguous boundaries show strong few-shot generalization, robustness in feature-poor conditions, and precise control over mask-level targets.

[CV-213] Radial Basis Function Networks as Projection Heads in Self-Supervised Learning

链接: https://arxiv.org/abs/2606.21590
作者: Andreas Schliebitz,Heiko Tapken,Martin Atzmueller
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Self-supervised learning (SSL) typically relies on a backbone encoder followed by a small multilayer perceptron (MLP) projection head, which is conventionally discarded after training, while backbone quality is assessed via costly linear probing on labeled data. We argue that this approach including discarding the projector is rather computationally wasteful. Instead, we propose replacing the MLP head with a radial basis function network (RBFN), whose interpretable center and shape parameters can be exploited to judge representation quality without labels or a separate classifier. To this end, we introduce Scale-Normalized Separation (SNS), a novel label-free quality metric derived solely from the kernel centers and shapes learned during training. Across five canonical SSL architectures (MoCo, SimCLR, BYOL, SwAV and SimSiam) and four image classification datasets, we show that RBFN projection heads are competitive drop-in replacements for standard MLP projectors. We recommend constructing them with three RBF layers activated by the Gaussian radial basis function. Moreover, SNS exhibits strong to very strong positive correlation with established logistic regression metrics, demonstrating that a trained RBFN projector can act as a reliable proxy for backbone representation quality. We additionally publish a novel PyTorch compatible image classification dataset based on Google’s Open Images V7 to facilitate reproducible research into representation learning.

[CV-214] he Unreason able Effectiveness of VLMs for Zero-shot Procedural Mistake Detection

链接: https://arxiv.org/abs/2606.21579
作者: Serdar Ozsoy,Lars Doorenbos,Federico Spurio,Gianpiero Francesca,Juergen Gall
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Procedural mistake detection is important for quality control and user assistance across many disciplines. Recent work in this field has achieved significant gains by using the reasoning capabilities of Video-Language Models (VLMs) as components within multi-stage pipelines, which consist of separate modules for supervised temporal action segmentation, error detection, and explainability. Consequently, they remain dependent on tailored training datasets and require task-specific training, limiting their wider applicability. To remedy this, we introduce zero-shot procedural mistake detection and propose a unified Zero-shot Procedural Mistake detection (ZeProM) framework that jointly solves procedural mistake detection and temporal action segmentation with a single pre-trained VLM. By evaluating our framework on two canonical mistake detection benchmarks, EgoPER and CaptainCook4D, we find that ZeProM can perform these tasks successfully, while approaching, or even outperforming, the performance of fully supervised methods. For instance, we achieve a 4.4 point improvement in EDA and a 2.0 point improvement in F1@.5 on average over all five EgoPER tasks compared to the strongest supervised methods. Overall, our results show the potential of unified methods for procedural mistake detection, and we hope this will steer the field away from highly complex pipelines and toward more generally applicable solutions.

[CV-215] A Smart Classroom Behavior Analysis Framework with a New Highly Congested Classroom Dataset

链接: https://arxiv.org/abs/2606.21568
作者: Wei Xu,Maoxiang Chu,Yuelong Fan,Guanghao Liao,Yinxiang Yu,Zhi Chen,Haotian Wang,Yutian Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 18 figures and 16 tables

点击查看摘要

Abstract:Student behavior detection is important for intelligent classroom analysis but remains challenging in large-class scenarios due to dense instance co-occurrence, asymmetric occlusion, depth-wise scale variation, and fine-grained semantic degradation in distant targets. Existing classroom behavior datasets and general-purpose detectors are insufficient to characterize and address these challenges. This paper constructs the Highly Congested Classroom Behavior (HCCB) dataset, containing 50,229 student behavior instances across seven categories: reading, writing, heads up, sleeping, looking around, bowing head, and using phone. HCCB provides a challenging benchmark that integrates dense distributions, severe occlusion, scale variation, and fine-grained behavioral semantics. To address these issues, we propose ODER-HSFNet, a YOLO-based detection framework tailored to highly crowded classrooms. At its core, ODER-HSFNet introduces three task-specific innovations: the Occlusion-aware Deformable Edge Rectifier (ODER), which strengthens boundary evidence under occlusion; the Hypergraph-State Spatial Fusion (HSSF) module, which integrates local structure enhancement, state-space contextual modeling, and high-order relation aggregation; and the Occlusion-Calibrated Detection Head (OCDetect), which suppresses low-quality Pre-NMS candidates and reduces false positives from occlusion boundaries and neighboring instances. Experiments on two classroom behavior detection datasets show that ODER-HSFNet outperforms mainstream YOLO-series methods, achieving 60.60%/80.12% mAP50:95/mAP50 on HCCB and 57.36%/74.65% on SCB-D3-S. Ablation studies further verify the effectiveness of the proposed design for highly crowded classroom behavior detection.

[CV-216] Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers

链接: https://arxiv.org/abs/2606.21562
作者: Philippe Weinzaepfel,Christian Wolf,Bülent Mert Sariyildiz,Guillaume Bono,Gianluca Monaci
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers are AI’s workhorse with strong performance in modeling sequential data, but their computational cost becomes prohibitive when processing long sequences. We target long-horizon streaming vision and robotics applications like map-free pose estimation, where it is particularly impractical to store and maintain a history of observations. Recurrent Transformers address this limitation by maintaining fixed-size memory but their performance lags behind that of transformers operating over the full observation history. We argue that this gap does not stem from architectural limitations, but from differences in how these models learn to compress past information. Without access to an observation history, recurrent models must explicitly decide what to retain in memory at each step, a significantly harder learning problem. In this work, we propose a distillation approach that transfers the compression strategy of a classical full-history transformer to a recurrent variant. We enable this by designing a teacher model that explicitly compresses its observation history into a fixed-size bottleneck representation. By directly supervising the student’s memory with this bottleneck representation, we align the two compression mechanisms. We show that this approach allows to train a recurrent latent robotic memory with linear-time complexity while substantially narrowing the performance gap to full-history transformers.

[CV-217] LOGOS: LiDAR-Only Gaussian Elevation Splatting for Unified Tiny Obstacle Segmentation

链接: https://arxiv.org/abs/2606.21527
作者: Nan Ming,Yeqiang Qian,Chunxiang Wang,Ming Yang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust obstacle segmentation is essential for the safety of intelligent robots, where LiDAR-based perception systems play a fundamental role in the robot-environment interaction. While extensive LiDAR-based approaches have demonstrated high performance on common obstacles in urban scenarios, their results on tiny obstacles such as curbs, gravel, and potholes remain unsatisfactory due to the significant similarity between tiny obstacles and inherent road undulations. Moreover, their segmentation accuracy even deteriorates sharply when the LiDAR scans suffer from degradation in challenging off-road scenes. To overcome these bottlenecks, we propose LOGOS, a LiDAR-only unified tiny obstacle segmentation system, which models the road surface as a continuous mixture of 2D Gaussian primitives and distinguishes tiny obstacles via high-presicion elevation estimation. Unlike existing Gaussian splatting methods that rely on iterative RGB training, LOGOS is a backpropagation-free LiDAR-only approach. It directly estimates Gaussian parameters via a freespace-aware initialization by incrementally pruning non-road primitives using smoothness constraints. Subsequently, pointwise signed distances are computed via a novel normal-aware elevation splatting function, ensuring robustness to both flat and sloped terrains. We evaluate LOGOS on a highly heterogeneous benchmark of point cloud frames collected from urban mobility scenarios and mining haulage off-road environments. These data are practically acquired using different LiDAR sensors and exhibit large variations in point density, terrain roughness, and obstacle types. Experiments on the road and off-road scenes demonstrate that LOGOS significantly outperforms other state-of-the-art methods, particularly in degraded point cloud regions and challenging off-road scenarios, while maintaining real-time efficiency.

[CV-218] Decoupling the Declarative from the Procedural in Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.21496
作者: Nikolaos Tsagkas,Andreas Sochopoulos,Chris Xiaoxuan Lu,Oisin Mac Aodha,Alexandros Kouris
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying generalist robotic agents in the real world requires transferable skills. Specifically, a policy trained to clone a behavior from object-specific demonstrations must generalize beyond that object, otherwise data collection requirements become intractable. Recently, fine-tuning of pre-trained billion-parameter Vision-Language Models (VLMs), initially on large-scale robot datasets and then on fewer scenario-specific demonstrations, has emerged as the predominant paradigm for designing Vision-Language-Action (VLA) models. While these policies achieve state-of-the-art manipulation performance in-distribution, they remain brittle to minor spatial, semantic, and task variations. In this work, we address the inability of current models to decouple the declarative (i.e., concepts and entity semantics) from the procedural knowledge (i.e., how to do something) encoded in their parameters, which is a fundamental bottleneck for zero-shot skill transfer to novel objects. To address this, we propose w ^2 VLA, a new VLA model with restructured information flow. Rather than feeding all multimodal tokens from the VLM encoder into a large, opaque transformer-based action expert, our approach modulates the robot state sequence with visual, spatial, and skill information in a compositional and interpretable manner. Unlike popular, state-of-the-art VLAs, we show that our modular approach successfully decouples knowledge representations, enabling robust behavior cloning and unprecedented zero-shot skill transfer capabilities across dissimilar, unseen objects.

[CV-219] Semi-Supervised Vision-Language-Action Model

链接: https://arxiv.org/abs/2606.21493
作者: Hongyang He,Jiuming Liu,Victor Sanchez
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable robots to predict actions directly from visual observations and language instructions, but adapting them to new environments still depends on costly action-labeled demonstrations. To reduce this dependence, we study semi-supervised VLA adaptation under limited supervision signals, where only a small portion of trajectories contain robot actions and the remaining trajectories provide action-unlabeled vision-language observations. Unlike standard semi-supervised learning, the missing supervision is an embodied action signal that must be visually grounded, language-consistent, physically feasible, and temporally stable. To address this problem, we propose SemiVLA, a self-distilled teacher-student framework that learns from reliable pseudo-actions on unlabeled trajectories. SemiVLA introduces a VLA-specific reliability controller to assess vision-language alignment, action feasibility, and temporal transition consistency, and further updates the teacher through a Bottleneck-Projected Alignment Update to avoid noisy feedback contamination. With OpenVLA as the backbone, SemiVLA consistently improves multiple PEFT strategies across LIBERO and CALVIN. Under 10% labeled trajectories, SemiVLA with Selective LoRA achieves 89.0% average success on LIBERO, outperforming supervised LoRA by 8.0 points without extra inference cost.

[CV-220] ASCII Art Turns LLM s into VLA Controllers

链接: https://arxiv.org/abs/2606.21470
作者: Yitao Jiang,Roy Xing,Luyang Zhao,Brian Plancher,Muhao Chen,Devin Balkcom
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision–Language–Action (VLA) controllers are often built by extending vision–language models (VLMs) with action supervision, relying on multimodal backbones with large data and compute requirements. We demonstrate that a text-only large language model (LLM) can be adapted into a VLA-style controller when visual observations are rendered into a text input using an ASCII representation. This ASCII-as-vision interface enables existing training and deployment stacks for LLMs to efficiently condition on visual state, follow natural-language instructions, and produce constrained, executable actions. We fine-tune and compare multiple LLMs and VLMs across model families and scales, using both expert demonstrations from a planning-based teacher, as well as DAgger for iterative improvement. In a 2D manipulation benchmark, in both simulation and on a physical manipulator, the resulting controllers can identify task-relevant entities and plan feasible action sequences. Our results suggest that ASCII rendering can serve as a lightweight, interpretable modality bridge from images to text, complementing conventional VLA pipelines, and opening directions for VLA research with text-only backbones.

[CV-221] Native space based pipelines outperform template space based pipeline in subcortical segmentation

链接: https://arxiv.org/abs/2606.21463
作者: Tomás Lima,Daniel Novák,Eduard Bakštein
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Accurate segmentation of subcortical regions is critical for neurosurgical planning and functional research. Most automated methods rely on template space coregistration, which may compromise patient-specific accuracy, particularly in small structures. We identify a need to evaluate whether native space approaches offer a measurable advantage, which we evaluate in the context of movement disorders. We developed two UNet-based segmentation pipelines of the Subthalamic Nucleus (STN) - a common surgical target in Parkinson’s Disease - and the neighbouring Red Nucleus (RN) and Substantia Nigra (SN). We collected 7T and 3T MRI data from five public datasets. The pipelines were evaluated in the native-space against manual labels. We further investigated the effect of the template resolution. Motivated by the hypothesis that models may better learn target boundaries in higher field, we tested the transferability of 7T-trained models to 3T clinical images, and whether synthetic 3T training data - generated via a disentangled representation learning method - could help bridging this domain gap. On held-out 7T data, the native pipeline consistently outperformed the template one. For the STN, native-space Dice reached 0.775 ± 0.055 versus 0.713 ± 0.051 (1 mm template), with HD95 of 0.79 ± 0.24 mm versus 1.17 ± 1.10 mm, respectively. Similar advantages were observed for the RN and SN. Increasing template resolution did not improve accuracy. When applied to 3T images, all models showed a considerable performance drop. Adding synthetic 3T data yielded only modest improvements, though without degrading 7T performance. Native-space segmentation is preferable for applications requiring patient specific anatomical fidelity, such as the surgical planning in PD. Bridging the 7T-to-3T domain gap remains an open challenge, motivating future work on domain adaptation tailored to subcortical structures.

[CV-222] chnical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Exploring Query-Based Segmentation and Increased Spatial Context for Outdoor Scene Understanding ICRA2026

链接: https://arxiv.org/abs/2606.21456
作者: David Pascual-Hernández,Roberto Calvo-Palomino,Inmaculada Mora-Jiménez,Jose María Cañas-Plaza
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Ranked 5th in the GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the IEEE ICRA 2026 Workshop on Field Robotics

点击查看摘要

Abstract:In this report, we present our submission to the GOOSE 2D Fine-Grained Semantic Segmentation Challenge, organized as part of the Workshop on Field Robotics at ICRA 2026. The challenge combines data from the GOOSE and GOOSE-Ex datasets, which comprise more than 13k images captured from 4 distinct camera setups, annotated using a hierarchical taxonomy of 56 fine-grained classes and 11 broader categories. Starting from SegFormer as a baseline, we progressively improve segmentation performance through increased training crop sizes, a transition to the query-based Mask2Former architecture, and test-time augmentation. Our experiments show that query-based segmentation significantly outperforms the baseline model. Furthermore, increasing the crop size used during training yields substantial gains, highlighting the relevance of preserving scene context for fine-grained semantic disambiguation. Our final submission, using test-time augmentation, achieves an mIoU of 69.6% on the challenge test set, providing a strong baseline for fine-grained semantic segmentation in outdoor environments. To facilitate reproducibility and future research, code and weights will be made publicly available at this https URL .

[CV-223] Synergistic Dual-Branch Adaptation for Multi-modal Generalized Category Discovery

链接: https://arxiv.org/abs/2606.21446
作者: Yuxun Qu,Minyu Zhou,Yongqiang Tang,Chenyang Zhang,Wensheng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify old categories and discover new ones from unlabeled data. Recent multi-modal approaches introduce retrieved or synthesized texts into a dual-branch architecture to provide semantic cues complementary to visual features. However, the cross-modal synergy in existing dual-branch methods remains coarse and incomplete: the two modalities are encoded independently with the bias and noise in the derived text left unaddressed during encoding, and existing mutual learning strategies operate only on global class-level anchors, lacking fine-grained relational supervision. To address these limitations, we propose the Synergistic Dual-Branch Adaptation (SDBA) framework, which serves as a plug-and-play enhancement compatible with existing dual-branch methods such as GET and TextGCD. SDBA comprises two components: the cross-modal synergistic adapter inserts lightweight adapters into both branches and further injects visual information into the text adapter at each encoder layer to enhance text feature learning during encoding; the neighborhood mutual learning module enforces consistent local neighborhood distributions between the two branches via bidirectional KL divergence, providing fine-grained relational supervision for both old and new classes. Extensive experiments on six benchmarks demonstrate state-of-the-art performance, and consistent improvements on different baselines validate the broad scalability of the proposed framework.

[CV-224] MIRCaps: A Large-Scale Mixed-Domain Dataset with Image-Level and Region-Level Captions for Fine-Grained Vision-Language Learning

链接: https://arxiv.org/abs/2606.21419
作者: Arlindo Luciano Tulumba Roberto,Hyungjoon Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent progress in Vision-Language Models (VLMs), mixed-domain image-caption datasets for both general-purpose and CCTV-based video surveillance systems remain limited. To address this gap, we introduce a large-scale multimodal dataset comprising 141,364 images, 981,947 image-level captions, 1,742,264 region-level captions, and 1,391,779 bounding box annotations. Each image is associated with an average of seven image-level captions describing different aspects of the overall scene, as well as seven region-level captions for each annotated bounding box. These complementary caption types are designed to help VLMs learn fine-grained visual attributes, including object categories, estimated sizes, colors, actions, states, and surrounding environmental context. We demonstrate the effectiveness of the dataset on two important downstream tasks: image captioning and object detection. Experimental results show that lightweight VLMs, including SmolVLM-256M-Instruct, BLIP, BLIP2, and Qwen2.5-VL 3B-Instruct, can be effectively fine-tuned using our dataset. Our dataset and code are publicly available at this https URL.

[CV-225] Robot Self-Improvement via Human-Video Dynamics Models

链接: https://arxiv.org/abs/2606.21406
作者: Hanzhi Chen,Anran Zhang,Simon Schaefer,Kejia Chen,Shi Chen,Daniel Cremers,Oier Mees,Stefan Leutenegger
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A central question in robot learning is how to acquire skills from the kinds of data that humans learn from: passive observation, embodied practice, and the experience of failure. Human videos provide the first of these in abundance, and prior work has shown they can initialize useful policies. Far less clear is whether they can support the second and third: whether priors extracted from human videos can ground a robot’s own attempts well enough to evaluate them, correct them, and improve from them. In this work, we show that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation required for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states: each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates from 40% to 81% across multiple policy backbones, demonstrating cross-embodiment robot self-improvement from human-video priors. These results show that human priors and robot failures can be combined to enable scalable autonomous policy improvement. Project page: this https URL.

[CV-226] VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.21386
作者: Florian Seligmann,Emiliyan Gospodinov,Enes Ulas Dincer,Gerhard Neumann
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language-action models (VLAs) achieve state-of-the-art performance on many robotic manipulation tasks, yet they can still behave unpredictably in out-of-distribution scenarios. Runtime failure detection is therefore essential for the safe real-world deployment of VLAs. However, existing task failure detectors require computationally expensive action sampling, are based on architectural assumptions that limit their applicability to VLAs, or need access to failure rollouts. We propose VLA-FAIL, a lightweight and broadly applicable failure detection framework for VLAs that combines two novel failure detectors with minimal overhead, without requiring failure data. The first, last-layer Mahalanobis distance (LLMD), detects out-of-distribution states by measuring token-wise deviations in last-layer features relative to the training data. The second, action chunk consistency (ACC), exploits the temporal overlap induced by receding-horizon control and detects failures when consecutive action chunks become inconsistent. To capture the trade-off between detection accuracy and detection latency, we introduce AUCPDT, a threshold-independent metric that jointly evaluates precision, recall, and detection time. Through extensive real-world and simulation experiments, we demonstrate that LLMD and ACC capture complementary failure modes whose combination enables reliable and early failure detection across diverse tasks, frequently outperforming significantly more expensive baseline methods.

[CV-227] EnTrust: Modeling Inter-Modal Conflict for Trustworthy Multimodal Medical Image Analysis

链接: https://arxiv.org/abs/2606.21384
作者: Dwarikanath Mahapatra,Abhijit Das,Behzad Bozorgtabar,Zongyuan Ge,Sudipta Roy,Deepak Nayak,Mauricio Reyes,Imran Razzak
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal medical imaging fuses complementary anatomical and functional information, yet modalities frequently disagree in pathologically heterogeneous regions. Current segmentation models handle this in one of two inadequate ways: deterministic fusion that averages away disagreement, or post-hoc uncertainty estimation decoupled from the fusion process that produces it. Both obscure the clinically critical question: why is this prediction unreliable? We present EnTrust, a framework that treats inter-modal conflict as the primary source of predictive uncertainty. Our EnFuse module decomposes multimodal features into three disentangled components: shared anatomical consensus (F_c), modality-specific cues (F_u,m), and spatially localized conflict signals (F_cf), with independence enforced via a cross-covariance objective. This structured decomposition conditions SegDiff, a diffusion-based generative segmentation model whose sampled hypotheses diverge specifically in regions of modal disagreement. TrustMap then translates this hypothesis divergence into calibrated, pixel-wise uncertainty using ensemble entropy, conflict-guided perturbation probing, and a learned calibration head, enabling clinicians to understand not only where predictions are uncertain, but why. Across four benchmarks spanning brain, cardiac, lesion, and oncology domains, EnTrust achieves state-of-the-art segmentation accuracy while reducing calibration error by 40% compared to the strongest baseline. Notably, it outperforms 5x deep ensembles using a single model at roughly half the memory footprint. Code and checkpoints are available at this https URL.

[CV-228] OSOG: A Differentiable Physics-Informed Synthetic Data Engine for Micro-Optical Environments

链接: https://arxiv.org/abs/2606.21381
作者: Caio Silva
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Optics (physics.optics)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Deep learning in computational microscopy is severely constrained by the scarcity of densely annotated datasets. While synthetic data generation has bridged this gap in macroscopic computer vision, traditional graphics engines rely on geometric ray-tracing, failing to capture the micro-optical phenomena required for microscopy. Conversely, while wave-optics formulations exist, rendering them computationally tractable at the scale required for deep learning remains a massive systems challenge. To address this, we introduce the Optical Synthetic Object Generator (OSOG), a high-performance, fully differentiable forward-modeling engine. Drawing on established physical models of diffraction and phase retardation, OSOG maps continuous Optical Path Difference (OPD) calculations into a highly optimized, PyTorch-native Structure-of-Arrays (SoA) architecture. We validate this computational framework across three axes: First, object detection models (YOLOv11-OBB) trained purely on OSOG-generated data achieve robust zero-shot transfer to real-world highly occluded Lysozyme micrographs. Second, we introduce DiffOSOG, demonstrating that the engine’s end-to-end differentiability allows for the exact recovery of continuous optical parameters via curriculum-guided inverse rendering. Finally, OSOG bypasses the \mathcalO(N) bottlenecks of sequential ray-tracing, demonstrating sub-linear scaling by synthesizing 40,000 complex wave-optic particles in under 50 milliseconds (\20 FPS). By providing a fast, scalable, and physically grounded tensor pipeline, OSOG enables true real-time, on-the-fly dataset generation.

[CV-229] FLM-Occ: Feed-forward Likelihood Maximization for Efficient Indoor Occupancy Prediction ECCV2026

链接: https://arxiv.org/abs/2606.21373
作者: Guangcheng Chen,Lihuang Fang,Huaqi Tao,Yicheng He,Li He,Hong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Recent indoor occupancy prediction methods adopt Gaussian primitives as a sparse 3D representation for computational efficiency. However, their training relies on voxel classification, which imposes only local constraints and lacks global supervision on the distribution of the primitives. Therefore, they inevitably predict spurious primitives in empty regions, undermining both representational and computational efficiency. To address this, we propose Feed-forward Likelihood Maximization (FLM), a novel framework that reformulates occupancy prediction as voxel distribution estimation. In FLM, a network is trained to predict a mixture model that maximizes the likelihood over ground-truth occupied voxels in a feed-forward manner. To enable end-to-end training of networks and voxelization of a standard mixture model, we define mixture weights as normalized primitive volumes to implicitly enforce simplex constraints and derive novel voxelization formulas. Based on FLM, our FLM-Occ, a novel method that is capable of relocating randomly initialized primitives over long distances to model a scene. On Occ-ScanNet, FLM-Occ achieves superior accuracy using only 32 superquadrics, 2.7% of the prior SoTA, while running 3.7 times faster.

[CV-230] Graph-of-Differences: Anatomy-Structured Difference Alignment for Medical Image Re-Identification

链接: https://arxiv.org/abs/2606.21368
作者: Nichula Wasalathilaka,Abhijit Das,Imran Razzak,Dwarikanath Mahapatra
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image re-identification (MedReID) enables longitudinal patient linkage but remains vulnerable to shortcut learning and often produces decisions that clinicians cannot audit against named anatomy. We propose Graph-of-Differences (GoD), which grounds identity comparisons in explicit anatomical structure. Each image is represented as an anatomy graph whose nodes correspond to named anatomical regions; given an image pair, soft node correspondence is established, and differences are computed over matched anatomy. A graph-level difference alignment objective ties these anatomy-matched differences to the global backbone difference, ensuring the retrieval signal is anchored in homologous structures rather than arbitrary spatial tokens. Explanations are defined over named graph nodes and quantitatively audited via node insertion/deletion tests, replacing unstable pixel heatmaps with verifiable structure-level evidence. On internal benchmarks, GoD improves Rank-1 by +7.1 pp on fundus and +3.1 pp on CXR over a strong frozen-backbone baseline, with further gains on zero-shot external transfers confirming that anatomy grounding improves both accuracy and generalization. Code is available at this https URL.

[CV-231] LEViL: Label-Efficient Video Learning via Zero-Shot Distillation over VLM-Generated Pseudo-Label Spaces

链接: https://arxiv.org/abs/2606.21358
作者: Aslı Çelik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised video pretraining is a common transfer learning practice for improving downstream action recognition performance. However, it requires large-scale labeled source datasets, and the effectiveness of the learned initialization is influenced by the similarity between the source and target domains. Constructing such labeled pretraining datasets for different target domains is costly and difficult to scale. To address these limitations, this study proposes a label-efficient video learning framework that combines annotation-free video pretraining with target-label-set-aware fine-tuning. During pretraining, a vision-language model (VLM) generates textual descriptions of unlabeled videos, which are processed to construct an interpretable semantic pseudo-label space. A frozen video-language model then produces zero-shot soft target distributions over this space, allowing a student video encoder to learn semantically rich representations without manual source annotations. During downstream adaptation, target-label-set-aware fine-tuning combines supervised learning from labeled target videos with zero-shot distillation over the actual target label set, helping preserve VLM-derived semantic guidance while adapting the pretrained encoder to the target task. Experiments on UCF101 and HMDB51 show that the proposed framework outperforms the compared semi-supervised video action recognition methods across all evaluated limited-label regimes. Moreover, the annotation-free pretraining stage learns transferable representations that provide an effective initialization for full-data fine-tuning, despite relying on a comparatively modest unlabeled pretraining pool.

[CV-232] WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife

链接: https://arxiv.org/abs/2606.21309
作者: Vandita Shukla,Kilian Meier,Lucie Laporte-Devylder,Camille Rondeau Saint-Jean,Jenna M. Kline,Blair R. Costelloe,Devis Tuia,Fabio Remondino,Benjamin Risse
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce WildBox, a dataset and benchmark for monocular 3D detection of wildlife from drone video, comprising 237,505 3D bounding box annotations across seven African savanna species grouped into six benchmark classes. Annotations follow a KITTI/Omni3D-compatible format in a per-segment scale-normalised camera frame, with instance identities maintained across each segment. We evaluate two open-vocabulary monocular 3D architectures, OVMono3D-LIFT and DetAny3D, under zero-shot, ground-truth 2D box prompt, and supervised fine-tuning protocols. Open-vocabulary 2D foundation models provide usable zero-shot wildlife localisation (50.55 AP@50), but zero-shot 3D detection collapses to 0.00 AP across both architectures and every 2D-input condition tested, including ground-truth 2D box prompts, thus isolating the failure to the 3D stage. Fine-tuning on WildBox recovers performance to 8.68 +/- 0.47 AP-BEV@0.50 and 13.17 +/- 0.69 AP3D macro. Depth contributes 84% of normalised Hausdorff distance after fine-tuning and over 99% in zero-shot, identifying monocular aerial depth as the dominant open problem in this regime. A coarse-to-fine curriculum, i.e. pretraining on a merged zebra class before fine-tuning on the Grevy’s/plains split, improves macro 3D performance with less total compute, with the largest gains on the two zebra subclasses. WildBox is released with video-level splits, evaluation code, and baseline checkpoints to enable progress in 3D wildlife perception from drone video.

[CV-233] A Test-time Actor-Critic Approach to News Images Generation

链接: https://arxiv.org/abs/2606.21304
作者: Damianos Galanopoulos,Vasileios Mezaris
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MediaEval 2026 Workshop, Amsterdam, NL

点击查看摘要

Abstract:This paper introduces the CERTH-ITI solution for the MediaEval NewsImages 2026 challenge, which focuses on generating images related to news headlines. Inspired by the Actor-Critic paradigm in reinforcement learning, we present a test-time, model-agnostic Actor-Critic Image Generation approach (ACIG). ACIG generates prompts for image creation, produces the images, evaluates the generated results, and if needed refines the image generation prompts accordingly in a feedback loop. ACIG achieved the best results in the NewsImages 2026 challenge, according to the challenge’s leaderboard.

[CV-234] SCOPE: Scale-Consistent One-Pass Estimation of 3D Geometry SIGGRAPH

链接: https://arxiv.org/abs/2606.21300
作者: Zheng Zhang,Lihe Yang,Tianyu Yang,Chaohui Yu,Yixing Lao,Xiaoyang Guo,Biao Gong,Fan Wang,Hengshuang Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Conference Papers 2026. 11 pages

点击查看摘要

Abstract:We present SCOPE (Scale-Consistent One-Pass Estimation of 3D Geometry), a novel approach for estimating 3D geometry from extended monocular video sequences, where existing methods struggle to maintain both geometric accuracy and temporal consistency across hundreds of frames. Our approach generates affine-invariant 3D point maps with shared parameters across entire sequences, enabling consistent scale-invariant representations. We introduce three key innovations: viewpoint-invariant geometry aligning multi-perspective points in a unified reference frame; appearance-invariant learning enforcing consistency across exponential timescales; and frequency-modulated positioning enabling extrapolation to sequences vastly exceeding training length. Experiments across diverse datasets demonstrate significant improvements, reducing relative point map error by 24.2% and temporal alignment error by 34.9% on ScanNet compared to state-of-the-art methods. Our approach handles challenging scenarios with complex camera trajectories and lighting variations while efficiently processing extended sequences in a single pass. Project page: this https URL.

[CV-235] Lightweight 3D Feature Pretraining by Bayesian Inversion of 2D Foundation Models

链接: https://arxiv.org/abs/2606.21292
作者: Marwane Hariat,Gianni Franchi,David Filliat,Antoine Manzanera
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Casper3D, a lightweight probabilistic framework for converting noisy multi-view 2D foundation-model embeddings into a latent 3D semantic representation. We model view-level semantic features as noisy observations of an underlying 3D semantic state and infer this state with a set-based variational model that incorporates relative pose during multi-view reasoning. Casper3D is trained by predicting held-out semantic observations from novel viewpoints, while remaining aligned with visual and text semantic spaces for open-vocabulary 3D understanding. The framework is backbone-agnostic and applies to both language-aligned and self-supervised embeddings. Experiments show that Casper3D produces more stable 3D semantics than simple multi-view pooling, especially in ambiguous and noisy settings.

[CV-236] NoduLoCC2026: Lung Nodule Localization and Classification Contest from Chest X-Ray Images

链接: https://arxiv.org/abs/2606.21290
作者: Adnan Mustafic,Halim Benhabiles,Adnane Cabani,Kristhian André Oliveira Aguilar,Romain Amigon,Clément Bardin,Chiara Bentifece,Marin Boehm,Kévin Bouchard,Laura Burattini,Diedre Carmo,Fahima Idiri,Matthis Lahargoue,Ilaria Marcantoni,Hicham Messaoudi,Cyril Meyer,Farid Meziane,Léon Morales,Letícia Rittner,Agnese Sbrollini,Léonard Zipper,Karim Hammoudi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:We propose NoduLoCC2026, a challenge on lung nodule detection and localization in chest X-ray images. We have provided a dataset for both tasks and received submissions from 5 international teams. The participating teams’ solutions are presented in this work along with results on an external dataset used for testing. Proposed methods show good performance on the classification task. The best method shows a balanced accuracy score of 0.72 and AUC-ROC of 0.79. We highlight the limitations of current approaches for the localization task, with the best approach having predicted the correct number of nodules on 53% of the test images with a median distance of 12.83mm, showing that it is a more challenging task than the first one. The challenge website is available via this https URL.

[CV-237] Unsupervised Domain Adaptation for Sim-to-Real Object Pose Estimation with Contrastive Alignment and Pseudo-Label Refinement

链接: https://arxiv.org/abs/2606.21287
作者: Nidhal Eddine Chenni,Arunkumar Rathinam,Djamila Aouada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) enables robust transfer of knowledge from simulated to real environments while exploiting a subset of unlabeled target data to improve real-world performance. Existing UDA methods for Object pose estimation often rely on global feature matching, multi-stage larger frameworks, or image translation pipelines, which tend to overlook the pose-specific information embedded in feature representations. To bridge this limitation, we introduce CAPLR that targets the adaptation of pose-sensitive features in localized regions, ensuring that domain alignment preserves the geometric cues essential for accurate pose estimation. CAPLR achieves UDA with three key components: (1) Efficient Cross-Domain Pairing strategy leveraging intermediate features to identify pose similar image pairs across domains without supervision; (2) Contrastive Alignment to perform feature alignment at localised regions in both intermediate and task-specific representations; and (3) Consistency-Based Pseudo-Label Refinement to improve reliability by encouraging stable target predictions. Extensive experiments demonstrate that CAPLR achieves state-of-the-art performance across multiple well-known object pose estimation benchmarks featuring diverse and challenging scenarios.

[CV-238] Beyond Damage Assessment: Recyclable Material Detection in Aerial Disaster Imagery Using a Lightweight Patch-Based Framework

链接: https://arxiv.org/abs/2606.21279
作者: Mahmoud Hazem,Karim Hammoudi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Nowadays, more and more disasters of different natures are appearing. Several disaster assessment approaches have been developed in order to identify damaged areas from aerial images. These damaged areas contain rich material that could be recycled towards several ecological purposes. In this paper, we present a lightweight approach that permits the efficient detection of recyclable material. Experimental results show the potential of the proposed approach towards localizing recyclable materials. Accordingly, we provide a rare dataset of material images that we labeled towards supporting the development of recyclable material detectors. The dataset of labeled material images is publicly available at: anonymous.

[CV-239] Few-Shot Hyperspectral Aphid Detection via FastGAN Synthetic Data Generation Transformer-Based Classification and Explainable AI

链接: https://arxiv.org/abs/2606.21267
作者: Ali Saeidan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 7 figures, 1 table

点击查看摘要

Abstract:Early detection of aphid infestation in crops is essential for preventing yield loss and reducing unnecessary pesticide use. Hyperspectral imaging combined with Spectral Information Divergence (SID) analysis offers a non-destructive approach for monitoring plant health; however, deep learning methods applied to hyperspectral data are often limited by small dataset sizes. In this study, a data-efficient generative adversarial network (FastGAN) was employed to augment a hyperspectral SID dataset of faba bean leaves containing healthy and aphid-infested samples. The trained generator produced 10,000 synthetic images preserving structural and spectral characteristics of real samples. Image quality was evaluated using Frechet Inception Distance (FID), demonstrating stable convergence and realistic reconstruction of leaf morphology and infestation patterns. The augmented dataset was used to train four classification architectures: VGG16, ResNet-50, EfficientNet, and Vision Transformer (ViT). Results showed that dataset augmentation significantly improved classification robustness, with performance progressively increasing from classical convolutional networks to transformer-based models. The ViT model achieved the highest accuracy and F1-scores, while EfficientNet provided strong balanced performance and ResNet-50 showed moderate improvements over VGG16. Confusion matrix analysis confirmed reduced false negatives and improved disease detection when using advanced architectures. The findings demonstrate that FastGAN-based augmentation effectively enhances hyperspectral plant disease classification and that transformer-based models provide the most reliable discrimination between healthy and infested leaves.

[CV-240] Spectral GS-SLAM: Observability-Aware Degeneracy-Robust Tracking for Real-Time 3D Gaussian Splatting SLAM IROS2026

链接: https://arxiv.org/abs/2606.21258
作者: Edward Beng Wai Tan,Siew-Kei Lam,Dongshuo Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted to IROS 2026

点击查看摘要

Abstract:Recent 3DGS-SLAM systems enable real-time operation by leveraging conventional feature matching or ICP-based tracking, thereby avoiding the heavy dense photometric optimization used in earlier approaches. However, feature matching remains prone to failure in textureless environments, while ICP-based tracking struggles in structureless or geometrically degenerate scenes due to ill-conditioned optimization. To address this issue, we propose Spectral GS-SLAM, an efficient yet robust tracking framework that integrates ICP with complementary feature-based constraints. Our method mitigates numerical instability by adaptively compensating under-constrained directions in degenerate scenarios, without interfering with the shared Gaussian representation used for mapping. We further introduce a Gaussian-aware planarity weighting mechanism that exploits the intrinsic covariance structure of 3D Gaussians to characterize scene geometry and guide information fusion. Extensive evaluations on challenging TUM RGB-D sequences demonstrate that Spectral GS-SLAM achieves real-time performance (40.14 FPS) while maintaining consistent tracking in both structureless and featureless environments. The proposed method preserves trajectory integrity in degenerate scenes while maintaining competitive performance in non-adverse conditions.

[CV-241] A Neurosymbolic Framework for Interpretable Skeleton-Based Seizure Detection via Concept-Driven Logical Reasoning MICCAI2026

链接: https://arxiv.org/abs/2606.21252
作者: Talha Ilyas,Deval Mehta,Zongyuan Ge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026 (Early Accept: top 9%)

点击查看摘要

Abstract:Video-based seizure detection is essential for the management of epilepsy patients, offering a non-invasive complement to electroencephalography. While several deep learning approaches have been developed for video-based seizure detection, none are inherently interpretable, limiting their adoption and translation into clinical practice. We present, to our knowledge, the first exploration of a neurosymbolic framework for video-based seizure detection that directly addresses this gap. Our approach (1) extracts patient-centric skeleton sequences from epilepsy monitoring units via a prompt-guided foundation model, (2) predicts binary spatio-temporal concept activations grounded in clinical motor semiology guidelines, and (3) composes them via differentiable logic into interpretable Boolean rules with auditable contributions. Furthermore, to mitigate false positives arising from the traditional binary formulation (seizure vs.\ non-seizure), we sub-classify non-seizure segments into clinically relevant normal activities, providing the model with fine-grained discriminative supervision. Evaluated on two public seizure video benchmarks, our framework achieves 89.78% sensitivity with 0.06 false detections per hour on SAHZU and 85.27%,0.09 on IEEE, while producing complete three-level interpretability: every prediction decomposes into which motor primitives were detected, how they were logically composed, and how much each rule contributed to the clinical decision. We publicly release all annotations, extracted pose sequences, our data pipeline and code, this https URL.

[CV-242] ACE-GS: Acing the Trade-off with Accurate Compact and Efficient 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.21244
作者: Jijian Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting achieves exceptional real-time rendering, but its substantial computational and storage demands hinder widespread deployment. Existing accelerated paradigms often aggressively prune primitives for rapid convergence, causing severe loss of high-frequency details. To address this, we tackle the fundamental problem of achieving both exceptional rendering quality and ultra-fast reconstruction speed. In this paper, we propose ACE-GS, a progressive optimization framework tailored for accurate, compressed, and efficient scene representation. We realize that precise primitive management is the key to breaking this trade-off. Therefore, we first design a momentum consistency-guided densification strategy, strictly constraining primitive growth onto authentic geometric manifolds to avoid computational waste while significantly accelerating convergence. Building upon this efficient initialization, we deploy a statistical sensitivity-driven sparsification mechanism to precisely prune redundant primitives, yielding a further compressed footprint. Finally, to thoroughly compensate for the risk of micro-structure loss caused by the aforementioned strict primitive control, we introduce a cross-dimensional residual frequency compensation scheme that explicitly back-injects high-frequency error energy into primitive attributes, perfectly restoring sharp geometric details. Extensive experiments validate our superiority. While maintaining a highly compact scene representation, our system achieves up to 3.7 times training acceleration against the rapid framework Speedy-Splat. Requiring only 3 to 5 minutes to converge, ACE-GS secures the highest structural similarity and achieves a peak PSNR improvement of up to 0.89 dB over the original 3DGS, establishing a new benchmark for ultra-fast and high-fidelity novel view synthesis.

[CV-243] DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration CCS2026

链接: https://arxiv.org/abs/2606.21240
作者: Tian Dong,Yan Meng,Shaofeng Li,Guoxing Chen,Yuling Chen,Zhen Liu,Haojin Zhu,Hao Chen
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM CCS 2026. Please cite this paper as “Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Yuling Chen, Zhen Liu, Haojin Zhu, Hao Chen. DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration. In the Proceedings of ACM Conference on Computer and Communications Security (CCS 2026).”

点击查看摘要

Abstract:Training datasets have tremendous proprietary value and are vulnerable to unauthorized copying. Existing defenses mainly focus on tracking individual data points, but pay little attention to the threat of dataset regeneration. Through a measurement study of public tumor datasets, we identify substantial real-world partial-dataset replication, raising concerns about potential license noncompliance. To counter the challenge of tracking previously unknown adversarial regeneration, our key insight is that regeneration that preserves model utility inevitably preserves measurable signals across multiple feature scales. We categorize these dataset features into sample-, set-, and distribution-level features and design four similarity metrics to accurately identify regeneration. Based on these metrics, we develop DIPBox, which to our knowledge is the first testing framework that tracks regeneration suspects via multi-scale similarity testing across a spectrum of defender access settings, from limited to full information. We further provide a learning-theoretic analysis that justifies these multi-scale metrics and formalizes an inherent utility–divergence trade-off, implying fundamental limits on evasive regeneration. Extensive experiments on 16 vision and text base datasets, 320 regenerated datasets, and 590 derived models validate that DIPBox outperforms previous solutions while characterizing its robustness and limits under three adaptive attacks.

[CV-244] Context-Aware Autoregressive Diffusion for Gloss-Wise Sign Language Production

链接: https://arxiv.org/abs/2606.21234
作者: JungHoon Sung,Boeun Kim,Chu Xin,Hyung Jin Chang,ChangHo Kim,Sang-Il Choi,Younggeun Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures, 4 tables

点击查看摘要

Abstract:To generate natural and accurate sentence-level sign language, synthesizing the “gloss”, the fundamental semantic unit, is essential. However, most current sign-language production (SLP) methods generate entire sequences at once. While this end-to-end approach is often efficient, it is prone to temporal drift and hand motion blur as sentences get longer, and fails to accurately control individual glosses. In this paper, we propose the Context-aware Gloss-wise AutoRegressive Diffusion model (GARD), a gloss-wise diffusion framework that models coarticulation by conditioning on both semantic (linguistic) and kinematic (motion) contexts. To ensure natural continuity between gloss motions, GARD introduces two additional strategies: i) Inter-Gloss Transition Guidance, which applies gradient-based guidance to kinematically align inter-gloss boundaries and ensure seamless pose consistency. ii) Global Motion Harmonizer, refining the entire gloss motion sequence based on the boundary poses adjusted by Inter-Gloss Transition Guidance. Extensive experiments on Phoenix-T and CSL-Daily datasets demonstrate that GARD achieves superior performance over existing SLP methods in terms of both linguistic accuracy and motion similarity.

[CV-245] Arc-Length Parameterized Interpolating Splines

链接: https://arxiv.org/abs/2606.21209
作者: Dafna K. Matsegora,Stephen M. Watt
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Mathematical Software (cs.MS); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:We present an iterative algorithm to compute an arc-length parameterized spline interpolating a set of points. This differs from other methods where the computed spline either does not interpolate the original points or the parameterization is not the arc-length of the returned curves. Our method is applicable in any dimension D \ge 2 , and we illustrate it with numerical results for plane curves.

[CV-246] Real-time pedestrian attribute recognition with YOLOv8 and ResNet18

链接: https://arxiv.org/abs/2606.21200
作者: Houssam El Mir
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pedestrian attribute recognition (PAR) assigns semantic labels to detected pedestrians and is useful in surveillance, video retrieval, and human-centered graphics applications. This paper presents a two-stage framework in which YOLOv8n detects pedestrians and ResNet18-based models classify gender, estimate apparent age, and predict 61 binary attributes from each pedestrian crop. PETA and PA-100K are combined through semantic attribute mapping, producing a unified training corpus of more than 100,000 pedestrian images while retaining the PETA attribute space. On the reported test splits, the system obtains 99.89% gender classification accuracy, a 4.23-year apparent-age mean absolute error, and 89.96% multi-attribute accuracy with a 36.32% macro F1-score and 58.80% micro F1-score. Runtime measurements indicate 25-30 FPS on an NVIDIA RTX 5060 GPU. The results show that a lightweight detector-classifier pipeline can support real-time PAR, while low macro F1 indicates that rare attributes remain challenging.

[CV-247] Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders ICANN

链接: https://arxiv.org/abs/2606.21197
作者: Sergio Lanza,Jae Hee Lee,Stefan Wermter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: International Conference on Artificial Neural Networks (ICANN), 2026, Padua

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated impressive performance in tasks requiring joint understanding of images and text, such as image captioning and Visual Question Answering (VQA), but our understanding of their internal processes remains limited. Recently, Sparse Autoencoders (SAEs) have emerged as a promising tool to support the interpretation of concepts encoded in VLMs. However, most SAE-based approaches focus only on textual or visual concepts separately, ignoring multimodal concepts. This limitation hinders a comprehensive understanding of VLMs, since concepts that integrate both modalities can be misclassified. Moreover, previous visual approaches often produce low-quality visual concept descriptions that are vague or incomplete, limiting their usefulness for understanding model reasoning. We propose a framework based on SAEs to extract and analyze visual, textual, and multimodal concepts from VLMs. For each neuron, we propose a candidate human-interpretable concept and compute the alignment between the concept and the dataset samples using cosine similarity scores. Experiments on a VQA dataset (LLaVA-NeXT) demonstrate that our framework improves visual concept quality by up to 45% compared to existing SAE-based methods, while maintaining high textual concept quality and enabling systematic identification of multimodal concepts. This work contributes new insights into the conceptual space of VLMs, providing a structured approach to distinguish between visual, textual, and multimodal concepts. The code is available at this https URL Comments: International Conference on Artificial Neural Networks (ICANN), 2026, Padua Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.21197 [cs.CV] (or arXiv:2606.21197v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.21197 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sergio Lanza [view email] [v1] Fri, 19 Jun 2026 08:08:43 UTC (3,344 KB)

[CV-248] HERO: Hypothesis-Driven Evidence Retrieval from Omics for Multi-Task Breast Cancer Analysis MICCAI2026

链接: https://arxiv.org/abs/2606.21174
作者: Xiangyu Li,Ran Su
类目: Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
备注: 11 pages, 3 figures, Early accepted at MICCAI 2026

点击查看摘要

Abstract:Matched multi-omics can improve WSI-based biomarker and prognosis prediction, but most existing pipelines use omics as a paral lel feature stream or textual context rather than as an explicit retrieval constraint. HERO asks whether observed omics can be a testable mor phology hypothesis: a sparse pathway-to-morphology prior maps DNA methylation and miRNA into a K-dimensional intent vector m (K=16), TF-IDF retrieval over structured 10 captions selects endpoint-relevant regions, and a cosine gate c=cos(m,v) triggers deterministic deficit driven repair when c\tauc. This closed-loop design bounds VLM calls, reduces reliance on embedding-based semantic matching, and makes every retrieval and verification step lexically auditable. On TCGA-BRCA (930WSIs, patient-level 5-fold CV), HERO sets new state-of-the-art across ER, PR, HER2, subtype, and risk prediction, outperforming both multimodal fusion and VLM-based baselines.

[CV-249] BadDreamer: Transferable Backdoor Attacks against Video World Models for Autonomous Driving

链接: https://arxiv.org/abs/2606.21172
作者: Zhe Shuai,Xiaopeng Xie,Yikun Zeng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures, 3 tables. Preprint

点击查看摘要

Abstract:Video world models are increasingly used in autonomous driving to forecast future scene evolution and provide future-aware spatio-temporal representations for downstream action prediction. In perception-to-action pipelines, these representations can directly influence ego-vehicle waypoint planning, making the learned future dynamics a critical security-sensitive component. Despite their promise, the training-time security risks of autonomous-driving video world models remain largely unexplored. We present BadDreamer, a transferable spatio-temporal backdoor attack that targets the perception side of this pipeline. Unlike conventional backdoors that manipulate image labels, prompt outputs, or action supervision, BadDreamer poisons the learned transition dynamics of a video world model. It constructs trigger-erasure sequences in which an oncoming yellow delivery rider is visible in the observed context frames but erased from the future frames. After fine-tuning on a small fraction of such sequences, the compromised world model learns a hidden conditional association: when the physical trigger appears, it hallucinates a future where the rider disappears and the road appears clear. We further show that this corrupted future-aware representation can transfer to the downstream action module without directly modifying ego-trajectory labels, inducing unsafe non-evasive waypoint predictions. Our experiments instantiate this attack on a representative open-source perception-to-action pipeline, revealing a representation-level safety risk in autonomous-driving video world models and highlighting the need for backdoor-aware validation beyond clean generation quality.

[CV-250] PIAvatar: Physically Interactive Avatars via Deformation Gradient Decoupling

链接: https://arxiv.org/abs/2606.21162
作者: Sang-Hun Han,Min-Gyu Park,Jisu Shin,Seunghyun Shin,Jin-Hwi Park,Hae-Gon Jeon
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 13 figures

点击查看摘要

Abstract:3D human avatars have shown impressive visual fidelity driven by pose-conditioned models, yet they still lack the physical ability required for interactions with each other and environments. Although recent studies have made various attempts to incorporate physical characteristics into 3D avatars, they only exhibit limited physical deformations, often leading to constrained interaction behaviors. To resolve this issue, we present PIAvatar, a framework to simultaneously enable physically aware interactions between avatar-avatar and avatar-environment, and a non-rigid deformable human body simulation. In this work, our key insight is to decouple kinematic velocity from deformation gradient. When external forces act on avatars, the kinematic velocity induces stress which hinders the avatar’s ability to achieve a desired pose. In addition, we integrate a skeletal framework within the avatar. It allows estimating its poses and real-time tracking in a closed form, even during non-rigid physical interactions. Our approach is implemented within a conventional Material Point Method framework to ensure physically consistent dynamics. We lastly evaluate the method on both human-object and human-human interaction scenarios to assess its behavior under diverse interaction settings.

[CV-251] Contrastive and Adaptive Multi-modal Masked Autoencoder for Spatial Transcriptomics

链接: https://arxiv.org/abs/2606.21156
作者: Joohyeok Kim,Taejin Jeong,Jinyeong Kim,Seong Jae Hwang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The high cost of spatial transcriptomics (ST) has driven extensive studies into predicting gene expression directly from HE histology images. However, this prediction task faces an inherent limitation, as tissue morphology alone provides insufficient information to fully resolve underlying gene expression. To address this limitation, a recent study leverages partial gene expression to guide the prediction process alongside histology images. Building on this paradigm, we approach the prediction task as a spatial imputation problem, employing a Masked Autoencoder (MAE) to utilize a small fraction of gene expression as genetic anchors for inferring whole-slide gene expression profiles. Specifically, we propose a bio-saliency score and a learning-to-rank strategy to adaptively identify the most informative spots within the tissue. Based on these identified spots, our framework selects contiguous regions as genetic anchors to ensure suitability for real-world ST profiling hardware. To effectively leverage these anchors, we design a cross-modal joint encoder that integrates visual and genetic modalities. By aligning the selected anchors with their corresponding visual features via contrastive learning, the encoder generates robust joint representations to accurately predict gene expression across the whole slide. Notably, our framework consistently surpasses existing methods in both histology-only prediction and spatial imputation, achieving superior accuracy even without genetic anchors and further excelling with as little as 10% transcriptomic coverage. Our code is available at this https URL.

[CV-252] ChronoLock: Protecting Videos from Unauthorized Text-to-Video Personalization

链接: https://arxiv.org/abs/2606.21146
作者: Jiaming He,Jiashu Zhang,Guanyu Hou,Shuhan Ye,Hanwei Zhu,Yi Yu,Xudong Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) diffusion models have made it increasingly easy to synthesize realistic and temporally coherent videos, while recent personalization techniques allow such models to imitate a specific subject, style, or motion pattern from only a few reference clips. This capability creates a new data-misuse risk: videos shared online can be collected and used for unauthorized T2V fine-tuning. Existing protective perturbations are mainly designed for image recognition or text-to-image personalization, and therefore focus on corrupting static appearance cues rather than the temporal denoising dynamics that make video personalization possible. To address this gap, we introduce ChronoLock, the first proactive protection framework that makes released videos difficult to exploit for unauthorized T2V personalization. ChronoLock targets the motion-learning process directly by optimizing bounded perturbations over temporal denoising trajectories. It first disrupts intra-chunk temporal adaptation with a diffusion objective that combines fitting error, frame-relative denoising relations, and adjacent-frame variation, and then enlarges inter-chunk boundary mismatch to weaken long-range motion continuity. Transformation-sampled updates further improve robustness to common preprocessing this http URL on UCF Sports and HMDB51 with popular T2V backbones and personalization scheme show that ChronoLock effectively reduces motion imitation under automatic metrics and human evaluation.

[CV-253] SEED: Simple ViT and Evolving Harness for Explainable Text Forgery Detection

链接: https://arxiv.org/abs/2606.21138
作者: Kahim Wong,Kemou Li,Yiming Chen,Haiwei Wu,Jiantao Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI-assisted image editing threatens trust in financial, legal, and identity records. The GenText-Forensics Challenge at ACM MM 2026 addresses this by requiring structured forensic reports, in which integrating detection, pixel-level localization, and natural language explanation for multilingual text-centric forgery images. We present SEED, a modular system with three components. First, a similarity-guided pipeline augments training with diverse synthetic forgeries. Second, a single ViT, built on DINOv3 with LoRA adaptation, jointly performs detection and pixel-level localization while preserving pre-trained priors with minimal trainable parameters. Third, an evolving harness takes the detector’s predictions and generates a complete forensic report via an MLLM, iteratively improved through a proposer-evaluator loop optimizing report quality. SEED ranked 3rd in the GenText-Forensics Challenge. Code and data are available at this https URL.

[CV-254] Odoriko: A Shape-Aware Multimodal Diffusion Framework for Human Motion ECCV2026

链接: https://arxiv.org/abs/2606.21135
作者: Dongseok Shim,Julian Tanke,Kengo Uchida,Christian Simon,Koichi Saito,Takashi Shibuya,Shusuke Takahashi,Yuki Mitsufuji
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: ECCV 2026

点击查看摘要

Abstract:Human motion generation has been widely studied across diverse input modalities, text, music, and video, and recent efforts have unified these into single multimodal frameworks. However, while morphological factors such as gender and body shape are known to produce distinct kinematic signatures, no existing unified framework incorporates this into generation, treating all subjects as morphologically equivalent. We present Odoriko, the first unified multimodal motion generation framework that reflects subject bio-morphological information directly in synthesized motion output. Rather than averaging over subject variation, Odoriko generates motion that is consistent with who is moving, not just what they are asked to do, across text, music, and video conditions within a single model. When explicit morphological information is unavailable, Odoriko additionally recovers subject morphology alongside motion, unifying estimation and generation in one framework. Extensive experiments across text-to-motion, music-to-dance, and video-to-motion benchmarks demonstrate that Odoriko matches or exceeds prior specialized models on standard metrics, while enabling morphology-consistent generation that no existing unified framework supports.

[CV-255] MammoExpert: Benchmarking Chain-of-Thought Reasoning in Mammography Diagnosis KDD2026

链接: https://arxiv.org/abs/2606.21119
作者: Di Dai,Bo Liu,Youcheng Li,Haojun Yu,Zhouhang Bian,Quanlin Wu,Dong Wang,Sichen Meng,Hongye Xuan,Zijie Lan,Shenda Hong,Liwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: KDD 2026

点击查看摘要

Abstract:Mammography is an essential tool for breast cancer detection, with millions of examinations conducted annually. However, publicly available high-quality mammography datasets for AI development remain limited in both scale and annotation richness, particularly regarding pathological subtype coverage and structured diagnostic reasoning annotations. In this paper, we present MammoExpert, the first mammography dataset with Chain-of-Thought reasoning annotations across three diagnostic phases: (i) primal observation, (ii) factual assessment, and (iii) diagnostic synthesis. Comprising 2,379 mammography images covering 67 WHO-classified histopathology subtypes, each exam provides 42 radiographic features annotated by nine senior radiologists. We evaluate its performance on the breast lesion classification task, demonstrating superior accuracy and reasonability compared to existing classification models. Combining public dataset CBIS-DDSM with MammoExpert yields 7.1% classification accuracy improvement, while the training model to learn CoT reasoning achieves another 4% gain on the MammoExpert test set. Similar improvements are observed on INBreast and Vindr datasets, where the full approach yields accuracy gains of 6.9% and 6.7%, respectively. MammoExpert can serve as a benchmark for interpretable breast lesion diagnosis through explicit CoT reasoning.

[CV-256] ConnectomeBench2: A Unified Benchmark for Automated Connectomic Proofreading

链接: https://arxiv.org/abs/2606.21116
作者: Jeff Brown,Tim Farkas,Gleb Razgar,Edward S. Boyden
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proofreading–correcting segmentation errors in 3D brain reconstructions–is the rate-limiting step in synapse-resolution connectomics. We release ConnectomeBench2, a unified multi-species dataset of over 716,485 expert-labeled proofreading decisions with 4,500,000 associated images spanning four major open connectomes (mouse, human, zebrafish, fly), spanning both split and merge error correction. Trained on this dataset, a single Vision Transformer with shared encoders for mesh geometry and electron microscopy reaches human-level accuracy across species for split error correction and merge error identification, with performance scaling with data size and modality. Beyond accuracy, we show that the model is well-calibrated within distribution, that measures of distribution distance predict where calibration and accuracy will degrade on unseen data, and that connectomics-specific pretraining and active learning-based sample selection show potential to substantially reduce the labeling effort needed to extend to new species and brain regions. The benchmark provides the infrastructure to train and evaluate increasingly capable vision models for connectomic proofreading. Data and code availability. The ConnectomeBench2 dataset is released on Hugging Face at this https URL. The accompanying codebase is available on GitHub at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.21116 [cs.CV] (or arXiv:2606.21116v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.21116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-257] MS-rPPG: Multi-spectral State Space Model for Remote Photoplethysmography in Driver Monitoring Systems

链接: https://arxiv.org/abs/2606.21115
作者: Jiho Choi,Sang Jun Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) is a camera-based technique for measuring physiological signals, particularly cardiac activity. From the remotely measured signals, heart rate can be estimated, which is crucial for health monitoring. In this study, we investigate a driver health monitoring system based on remote heart rate estimation. However, driving environments represent uncontrolled settings where videos are subject to varying illumination conditions and frequent head movements. We introduce MS-rPPG, a multi-spectral framework that combines RGB with near-infrared (NIR) face video to alleviate rPPG estimation under challenging driving conditions. To combine the complementary features from two spectral videos, we propose a cross-spectral linear modulation (CSLM) strategy based on frequency-domain analysis. Moreover, we introduce MS-Mamba, a novel state space model designed to effectively model long-range temporal dependencies while jointly capturing cross-channel interactions between multi-spectral features. We collected a real-world dataset called MS-Drive, which was recorded from 50 participants while driving the vehicle. The proposed method was evaluated on the MR-NIRP Car dataset and MS-Drive datasets. The experimental results indicate that MS-rPPG shows better robustness and heart rate estimation accuracy than previous methods, highlighting its promise for driver health monitoring. The codes are available at this http URL.

[CV-258] Object-Centric Dataset Resources for Constrained-Data Image Generation and Augmentation

链接: https://arxiv.org/abs/2606.21113
作者: Vasile Marian,Yong-Bin Kang,Alexander Buddery
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages including references, 2 figures, 2 tables. Dataset and related files at this https URL and this https URL

点击查看摘要

Abstract:Object-centric image generation is important in settings with few labeled examples, including pedestrian analysis in smart-city scenes, traffic-sign inspection, and domain-specific object detection. Synthetic images are most useful for training and evaluation when datasets preserve object structure, bounding boxes, visual diversity, and realistic context. Existing image datasets usually target classification, detection, or scene understanding rather than controlled object-centric generation and augmentation with limited class-specific data. We present a shareable collection of three object-centric dataset resources: Cityscapes-Pedestrian, TrafficSigns, and COCO PottedPlant. The collection standardizes 256-by-256 object-centric crops and bounding-box annotations across three regimes: dense pedestrian scenes with privacy blur and occlusion, cleaner high-contrast traffic signs, and context-diverse potted-plant scenes. The release contains 3,009 TrafficSigns samples, 2,156 Cityscapes-Pedestrian manifest records, and 7,679 COCO PottedPlant manifest records. The larger COCO-derived manifest preserves contextual and multi-instance diversity, while equal-size subsets can be drawn with a fixed random seed for controlled comparisons. The release provides direct TrafficSigns data where redistribution is permitted, together with scripts, manifests, box-level annotation tables, checksums, and reconstruction documentation for the Cityscapes- and COCO-derived subsets. It is available through the Latzi/object-centric-low-data-datasets GitHub repository and Zenodo DOI https://doi.org/10.5281/zenodo.20573001. The collection supports label and split inspection, subset creation, reconstruction from upstream data, and evaluation of object-centric image generation or synthetic-data augmentation methods on shared records.

[CV-259] SARIF: Segment Anything for Robust Image Forensics ECCV2026

链接: https://arxiv.org/abs/2606.21108
作者: Dong-Hyun Moon,Ju-Hyeon Nam,Sang-Chul Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Equal contribution: Dong-Hyun Moon and Ju-Hyeon Nam. Corresponding author: Sang-Chul Lee. Code: this https URL

点击查看摘要

Abstract:Image forgery localization remains challenging due to diverse manipulation techniques and distribution shifts. Existing forgery localization models achieve high accuracy on benchmarks but often struggle with cross-domain generalization and robustness. In this paper, we propose SARIF (Segment Anything for Robust Image Forensics), a framework that leverages the Segment Anything Model (SAM), which has a promptable architecture and strong generalization ability. SARIF introduces a feedback-guided mask decoder and a dual-encoder design that extracts forgery-specific information to capture forensic traces while exploiting the SAM architecture. To localize manipulated regions, we design a block-wise prompting mechanism that derives forgery-specific cues from residual features between an adapted encoder and its frozen counterpart. These features are fused with the previous mask prompt to drive a feedback-based mask refinement process, enabling automatic forgery segmentation without manual input. Extensive experiments on standard forgery-localization benchmarks show that SARIF achieves strong average cross-dataset performance and robustness to common image corruptions.

[CV-260] ShuffleFlow: Scalable Posterior Inference for Bayesian Inverse Imaging

链接: https://arxiv.org/abs/2606.21099
作者: Tianao Li,Tjitske Starkenburg,Yu Sun,Emma Alexander
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE International Conference on Computational Photography (ICCP), 2026

点击查看摘要

Abstract:Variational inference (VI) is a powerful method for principled posterior inference for scientific inverse imaging. VI learns the posterior distribution, often with a flow-based network, which can cheaply generate posterior samples upon optimization, and can flexibly incorporate score-based or classic priors. However, its application to large-scale image reconstruction is severely hindered by the poor scalability of the flow-based networks. In this work, we introduce ShuffleFlow, a scalable VI framework to address this challenge. Our method breaks down the problem into three parts: a pixel-unshuffling-based image coordinate sampler, a neural field as feature encoder, and a conditional normalizing flow (CNF) as posterior estimator. Specifically, our framework partitions an image into a stack of sub-images with pixel-unshuffling and uses a shared CNF to model the joint distribution of the sub-image stack. We condition the CNF on the output of a neural field, which embeds feature vectors corresponding to pixel-unshuffling sample locations to capture spatial structures, and share the flow’s latent variable across the channels to model their correlations. We demonstrate our method’s effectiveness and efficiency on both linear and nonlinear imaging inverse problems, and show its ability to more rapidly generate a high-sample-count posterior than diffusion samplers.

[CV-261] How Should a Robot Configure Its Laser Scanner for Inspection? IROS2026

链接: https://arxiv.org/abs/2606.21093
作者: Zhiling Chen,David Gorsich,Matthew P. Castanier,Yang Zhang,Jiong Tang,Farhad Imani
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures. Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

点击查看摘要

Abstract:Robotic inspection relies on accurate sensing to acquire high-fidelity geometric measurements for defect detection and metrology. While prior work has focused on robot motion and viewpoint planning, how to configure sensing parameters remains largely underexplored, despite their decisive impact on measurement quality. We propose SenseHD, a robotic sensing system that formulates scanner configuration as an instruction-conditioned sensing decision. Instead of predicting precise parameter values, SenseHD treats sensing parameters as discrete sensing actions and selects stable sensing regimes through hyperdimensional associative memory. Experiments on a real robotic inspection platform demonstrate that SenseHD robustly selects appropriate sensing configurations and significantly improves inspection reliability, while remaining lightweight and efficient compared to baseline methods.

[CV-262] Neural Architecture Distributions: A New Paradigm for Stochastic Segmentation

链接: https://arxiv.org/abs/2606.21061
作者: Conghui Li,Junhao Huang,Chern Hong Lim,Bing Xue,Mengjie Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stochastic segmentation seeks to represent multiple plausible masks for a single image, which is essential in safety- and quality-critical applications such as medical imaging or building defect inspection. Most existing methods introduce stochasticity by injecting continuous latent variables or by iterative denoising trajectories, whose stochastic sources are difficult to search or audit directly. We propose architecture distributions as a new stochastic source for segmentation: instead of sampling a latent variable or noise, we sample a discrete architecture from a learned distribution over operator choices at multiple searchable positions in a segmentation backbone. Each sampled architecture yields one mask through the selected active path, so inference depends on the executed subnet rather than the complete candidate bank. This approach also supports architectural provenance, since each output corresponds to a specific architecture configuration. To reduce collapse toward averaged masks, we train with set-level supervision by matching a set of architecture-sampled predictions to the annotation set using an IoU-based energy-distance surrogate. We further construct the candidate bank with evolutionary search, making the support of the stochastic source optimizable before distribution learning. The proposed method achieves state-of-the-art distribution matching and hypothesis coverage on LIDC-IDRI, and remains effective on two extension tasks. To the best of our knowledge, this is the first work to formulate stochastic segmentation as learning an architecture distribution and realizing output diversity through architecture sampling.

[CV-263] Self-Supervised Dual-Frequency Phase Decomposition for Single-Shot Composite Fringe Projection Profilometry

链接: https://arxiv.org/abs/2606.21027
作者: Jin-Hyuk Seok,Yatong An,Jae-Sang Hyun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-shot fringe projection profilometry (FPP) has been actively studied for real-time measurement, dynamic object reconstruction, and motion-sensitive environments. Composite fringe patterns are advantageous in single-shot FPP because multiple frequency components can be encoded in a single pattern, enabling phase ambiguity resolution. Existing approaches mainly rely on Fourier transform-based methods or supervised deep learning methods. However, Fourier transform-based methods often suffer from limited accuracy and degraded performance in complex regions, while supervised methods require dense phase or depth labels, which are costly to obtain. In this work, we propose a self-supervised phase refinement framework for single-shot composite fringe patterns without requiring phase or depth labels. The proposed method exploits the scale and direction relationships between low- and high-frequency phase gradients, improving the reliability of phase separation. We also introduce a soft edge consistency loss to preserve object boundaries and fine geometric structures. Experimental results show that the proposed method achieves MAE_z and RMSE_z of 0.367 mm and 1.804 mm, respectively, outperforming the best-performing transform-based baseline, which obtains 0.402 mm and 2.785 mm. The proposed method also improves the valid-pixel ratio from 84.75 % to 95.07 %. These results demonstrate the effectiveness of self-supervised dual-frequency phase refinement for reliable single-shot 3D reconstruction without ground-truth label supervision.

[CV-264] Sparse Point-Guided Fusion of Supervised and Self-Supervised Learning Model for Seaweed Segmentation

链接: https://arxiv.org/abs/2606.21026
作者: Tatsuya Suzuki,Kazuya Ijuin,Hideki Tomimori,Megumi Chikano,Katsushi Sakai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ASME OMAE 2026

点击查看摘要

Abstract:The ocean plays a critical role in sustainable development, particularly in climate change mitigation. Among marine ecosystems, blue carbon ecosystems are recognized as important natural carbon sinks. In this context, this paper addresses precise seaweed classification for blue carbon quantification in Ocean Digital Twin initiatives. Conventional methods, including supervised learning (limited by data scarcity and domain gaps) and self-supervised learning (unable to assign class labels), struggle with underwater complexities and diverse seaweed species. To overcome this, we propose a novel two-stage seaweed segmentation technique. This technique first utilizes Supervised and Self-supervised Learning Model Propagation (this http URL.), which leverages supervised learning for initial class information and approximate locations, guiding self-supervised learning for detailed, accurate segmentation. Subsequently, MaskFusion (MF) refines these results by merging instance-level masks for highly accurate segmentation. This integrated approach allows automatic class label assignment and mitigates domain gap effects. Specifically, instance segmentation estimates sparse point locations which then guide self-supervised learning for detailed region segmentation. Evaluated with underwater images from Yamaguchi Prefecture, our full proposed method (this http URL.+MF) achieved a 0.068 mIoU improvement over USIS-SAM, demonstrating significant accuracy gains, particularly for small seaweed. This approach demonstrates strong potential for improving blue carbon quantification and marine ecosystem monitoring.

[CV-265] CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

链接: https://arxiv.org/abs/2606.21020
作者: Geon Choi,Hangyul Yoon,Nalee Kim,Jeong Yun Jang,Hyunju Shin,Hyunki Park,Sang Hoon Seo,Edward Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evaluation of vision-language models (VLMs) for chest X-ray (CXR) analysis has largely been limited to disease-presence classification without visual grounding. Such evaluations fail to verify the expert-level lesion perception necessary to ensure the clinical reliability of VLMs. To address these limitations, we introduce CheXpercept, a sequential, multi-level perception benchmark that mirrors a radiologist’s cognitive workflow across coarse-level detection, fine-level contour evaluation and revision, and semantic-level attribute extraction. To ensure high clinical fidelity at scale, we construct the dataset using a semi-automated generation pipeline paired with a review by six medical experts. CheXpercept contains 10,400 QA items derived from 2,100 CXRs, covering seven clinically critical pulmonary and cardiac lesions. To demonstrate the current landscape of VLM perception, we benchmark 14 general and medical VLMs on CheXpercept. The models achieve adequate performance only at the coarse level, with accuracy degrading precipitously on deeper visual tasks. Notably, medical VLMs show almost no perceptual advantage over their general-domain counterparts, highlighting a systemic flaw in current domain adaptation. The code and dataset will be publicly available.

[CV-266] Robusto-2: Benchmarking Humans VLMs for Autonomous Driving in Lima New York City

链接: https://arxiv.org/abs/2606.20980
作者: Adrian Cespedes,Marcelo Chincha,Dunant Cusipuma,Victor Flores-Benites,David Ortega,Arturo Deza
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 11 pages main body. 42 pages total. Data publicly available online

点击查看摘要

Abstract:As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City – prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses – though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: this https URL

[CV-267] UNITY: Attention Flow Networks for Adaptive Conditioning in Diffusion ECCV2026

链接: https://arxiv.org/abs/2606.20971
作者: Aryan Das,Koushik Biswas,Moloud Abdar,Vinay Kumar Verma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Acccepted in ECCV 2026

点击查看摘要

Abstract:We introduce UNITY, a Universal-to-Specialized adapter for efficient and scalable composite conditioning in diffusion based image generation. Unlike prior methods that train separate adapters for each conditioning modality, UNITY jointly learns shared semantics across multiple conditioning types and subsequently specializes without modifying the underlying architecture. The proposed two stage training paradigm consists of a Universal Stage that captures cross modal representations across all conditioning modalities using half of the total training steps, followed by a Specialization Stage that refines modality specific features using the remaining training budget. At the core of UNITY are the Morphable Attention Flow (MAF) Network and Morph Wrapper modules, which enable channel aware and spatially adaptive feature alignment through learnable flow fields and attention based fusion. This constant complexity formulation supports flexible operation under both single and composite conditioning settings while significantly reducing inference latency and memory consumption. Extensive experiments across multiple datasets demonstrate that UNITY achieves state of the art image fidelity while maintaining superior memory efficiency. Code: this https URL

[CV-268] CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

链接: https://arxiv.org/abs/2606.20970
作者: Yifan Shen,Pei Tian,Xinzhuo Li,Bowen Fang,Shujun Xia,Bingxuan Li,Ana Jojic,Wenming Ye,Xu Cao,James Matthew Rehg,Ismini Lourentzou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, a schema-guided Mixture-of-Experts framework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example by cross-modal relation, reasoning demand, and temporal scope, and aligns global routing signatures with this structure during supervised fine-tuning. We further introduce route-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards for answer correctness, modality-consistent reasoning, and cognitive temporal grounding. To support training and evaluation, we construct OmniSocialBench, a diagnostic social video QA resource with 118K structured training examples, grounded reasoning traces, schema labels, temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38% average accuracy on OmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.

[CV-269] ELDiff: When Evidential Learning Meets Text-to-Image Diffusion

链接: https://arxiv.org/abs/2606.20924
作者: Qingtao Pan,Kai Ye,Zhihao Dou,Bing Ji,Shuo Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In multi-object text-to-image (T2I) diffusion, ensuring semantic consistency between textual prompts and generated visual content is crucial for image synthesis. However, such consistency constraint is often underemphasized in the denoising process of diffusion models. Although token supervised diffusion models can mitigate this issue by learning object-wise consistency between the image content and object segmentation maps, it tends to suffer from the problems of segmentation map bias and semantic overlap conflict, especially when involving multiple objects. In this paper, we propose ELDiff, a new evidential learning-supervised T2I diffusion model, which leverages the advantages of uncertainty metric and conflict detection to enhance the fault tolerance of unreliable segmentation maps and suppress semantic conflicts, strengthening object-wise consistency learning. Specifically, a pixel evidence loss is proposed to restrain overconfidence in unreliable labels through evidential regularization, and a token conflict loss is designed to weaken the contradiction between semantics through optimizing a measured conflict factor. Extensive experiments show that our ELDiff outperforms existing training based and train-free based T2I diffusion models on SD v1.4, SD v2.1, SDXL, SD v3.5, and Qwen-Image, without requiring additional inference-time manipulations. Notably, ELDiff can be seamlessly extended to the existing training pipeline of T2I diffusion models. Code can be found at this https URL.

[CV-270] GIM-ENDO: A Multimodal Endoscopic Image and Video Dataset for Gastric Intestinal Metaplasia Morphology and Pathology

链接: https://arxiv.org/abs/2606.20919
作者: Mojgan Forootan,Mahziar Setayeshfar,Ali Darvishi,Mohammad Tashakoripour,Hamidreza Bolhasani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gastric intestinal metaplasia (GIM) is a precursor lesion to gastric dysplasia and adenocarcinoma whose early detection is crucial for intervening in the carcinogenesis cascade. Artificial intelligence (AI) holds considerable promise for real-time endoscopic detection and characterization of GIM. However, development of reliable AI models has been constrained by the absence of publicly available, histopathologically validated datasets that combine detailed endoscopic annotations, histological subtype (complete and incomplete), standardized grading systems, and normal mucosal patterns. GIM-ENDO was designed to fill this gap. The dataset comprises demographic data, endoscopic findings, histopathological results, and H. pylori status acquired using the Olympus EVIS X1 system with white-light endoscopy (WLE) and image-enhanced endoscopy (IEE), including narrow-band imaging (NBI) and magnifying NBI (M-NBI), along with images and video clips from 24 patients (22 GIM-positive, 2 normal controls). Annotations cover six primary IEE endoscopic signs – light blue crest (LBC), marginal turbid band (MTB), white opaque substance (WOS), TV pattern (Fusion), atrophy, and map-like erythema (MLE) – plus two additional endoscopic findings (AHP and GA) recorded where present. GIM subtypes (complete and incomplete) are annotated for all GIM-positive cases; OLGA and OLGIM staging are provided where complete histological sampling was available. The dataset is publicly accessible at this https URL. For the latest updates and further information regarding this dataset, readers are referred to the DataBioX website: this https URL A short version of this work has been submitted to MICCAI 2026 Open Data Track. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.20919 [cs.CV] (or arXiv:2606.20919v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.20919 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-271] PROTON: Prototype-Based Test-Time Online OOD Detection for Medical VLMs

链接: https://arxiv.org/abs/2606.20913
作者: Abhijit Das,Nichula Wasalathilaka,Yifan Lu,Adinath Dukre,Dwarikanath Mahapatra,Shadab Khan,Imran Razzak
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) enable zero-shot clinical image classification, yet reliably detecting out-of-distribution (OOD) inputs at deployment remains an open problem. No static scoring method works across all shift types: Maximum Concept Matching (MCM) on FLAIR achieves 76.4% AUROC for far-OOD but only 42.4% for covariate shifts such as ultra-wide-field fundus images, effectively random. We trace this to a structural mismatch: covariate-shifted inputs are indistinguishable from in-distribution samples in softmax space, yet occupy distinct regions in the VLM embedding space. To exploit this untapped signal, we propose PROTON (PROtotype-based Test-time ONline OOD detection), a lightweight post-hoc module that maintains an online prototype bank from high-confidence test predictions and adaptively fuses prototype distance with MCM scoring via stream-level variance statistics, requiring no model modification, training data, or prompt engineering. On the ophthalmology benchmark FLAIR + FIVES, PROTON improves MCM by +23.9 AUROC on covariate shift, +8.8 on semantic shift, and +8.1 on far-OOD, making it the only zero-shot method to improve all three without hierarchical prompts or labeled data. Code is available at this https URL, and the project page is available at this https URL.

[CV-272] BELDE: Building a Large-scale Earth-observation Land-cover Dataset for Europe

链接: https://arxiv.org/abs/2606.20909
作者: Ümit Mert Çağlar,Alptekin Temizel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Earth observation imagery plays a critical role in environmental monitoring, urban planning, disaster assessment, and climate analysis. While multi-spectral sensors are increasingly available, true-color (RGB) imagery remains widely used due to the power, cost, and deployment constraints of many satellite and aerial platforms. However, existing land-cover segmentation datasets are often limited in geographic coverage, scale, or public accessibility. To bridge this gap, we introduce BELDE (Building a Large-scale Earth-observation Land-cover Dataset for Europe), a publicly available dataset tailored for RGB-based remote sensing semantic segmentation. Constructed from Sentinel-2 true-color images and ESA WorldCover data annotations, BELDE contains 1,088,385 curated image-segmentation map pairs spanning Europe with 7 land-cover classes at 10 m spatial resolution, making it one of the largest publicly available RGB land-cover segmentation datasets for Earth observation. To facilitate cross-region generalization studies, we additionally introduce BELDE-K (16,607 pairs) covering the Republic of Korea and BELDE-CA-NV (88,155 pairs) covering California and Nevada in the United States. We establish baseline results using multiple semantic segmentation architectures and evaluate both in-domain and cross-domain performance. Models trained on BELDE achieve an F1 score of 83.0% on the European test set, while performance decreases to 66.4% on BELDE-CA-NV and 58.3% on BELDE-K, highlighting the challenges posed by out-of-distribution geographic domain shift. By providing a continental-scale RGB segmentation and evaluation benchmark, BELDE supports the development of robust and transferable Earth observation models. The dataset and benchmark resources will be publicly released. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2606.20909 [cs.CV] (or arXiv:2606.20909v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.20909 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-273] Go-with-the-Track: Video Compositing and Motion Control with Point Tracking SIGGRAPH2026

链接: https://arxiv.org/abs/2606.20891
作者: Koichi Namekata,Yash Kant,Zhizheng Liu,Ryan D Burgert,Yuancheng Xu,Kuan Heng Lin,Emmett Steven,Julien Philip,Li Ma,Andrea Vedaldi,Paul Debevec,Ning Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: SIGGRAPH 2026, Project page: this https URL

点击查看摘要

Abstract:Filmmaking demands precise motion control and reference image compositing – capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks – extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model’s ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: this https URL Comments: SIGGRAPH 2026, Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.20891 [cs.CV] (or arXiv:2606.20891v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.20891 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Koichi Namekata [view email] [v1] Thu, 18 Jun 2026 19:40:03 UTC (26,450 KB)

[CV-274] Fine-grained Human Motion Understanding with Language Models

链接: https://arxiv.org/abs/2606.20888
作者: Thomas Markhorst,Zhi-Yi Lin,Jouh Yeong Chew,Jan van Gemert,Xucong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose \methodname, an LLM-based model for fine-grained human motion understanding that represents motion as a sequence of skeletal poses with explicit timestamps for each pose. Each pose encodes body joint positions and is temporally grounded with timestamp tokens, allowing the model to reason about motion order, duration, and rhythm. To study what supervision is needed for motion-language reasoning, we construct a diverse training mixture spanning pose captioning, pose question answering, motion captioning, and motion question answering. Our ablations show that the primary gains come from the diversity of pose- and motion-level supervision, while staged training provides a smaller additional benefit. Different from previous works that rely on ground-truth 3D motion capture, our approach supports both 2D and 3D skeletal motion representations through a unified pose encoder, and can optionally incorporate video to provide contextual information. Extensive experiments on BABEL-QA, HuMMan-QA, CompMo, NTU-RGB+D, and QEVD-Coach demonstrate that our method achieves state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of explicit temporal encoding and diverse pose- and motion-level supervision for fine-grained human motion understanding. Notably, even when using only 2D skeletal input, our approach surpasses previous 3D-based methods.

[CV-275] oward Parking Spot Occupancy Recognition: A Self-Supervised Approach

链接: https://arxiv.org/abs/2606.20886
作者: Luan Marko Kujavski,Rayson Laroca,Paulo Lisboa de Almeida
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

点击查看摘要

Abstract:As urban areas expand, automatic monitoring of parking lots becomes essential for efficient and sustainable cities. This work proposes a self-supervised approach for parking spot occupancy recognition that requires no labeled samples from the target parking lot. Building upon a self-supervised transfer learning fine-tuning protocol, the proposed training strategy consists of two self-supervised stages: first on unlabeled generic data and then on unlabeled target-specific data, followed by supervised fine-tuning using only generic parking lot labels. We adopt SimCLR with a ResNet-50 encoder and evaluate the method under a leave-one-out cross-environment protocol on three public datasets: PKLot, CNRPark-EXT, and PLds. We also introduce a two-stage deployment strategy in which a Strong General Model is initially deployed, followed by a Specialized Model that incorporates unlabeled images collected during the first N days of deployment in a self-supervised manner. Experimental results show that the Strong General Model alone outperforms supervised and self-supervised baselines, achieving an average accuracy of 97.2%, which further improves to 97.8% with the proposed two-stage strategy. These results demonstrate that self-supervised learning enables a scalable and labelefficient solution for real-world parking occupancy monitoring. Our trained models and source code are publicly available at this https URL.

[CV-276] FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation ICML2026

链接: https://arxiv.org/abs/2606.20867
作者: Duc Minh Nguyen,Nghiem Tuong Diep,Binh Gia Nguyen,Trong-Bao Ho,Doanh Le,Tan Q. Nguyen,Thien-Loc Ha,Nhiem Tran,Bao Thach,Nhat X. Tran,Tuan A. Tran,Artur Habuda,Philip Lund Møller,Tran Nguyen Le,Daniel Sonntag,Matthias Niepert,Khoa D. Doan,Vu Duong,Hung Ngo,Minh N. Vu,Duy M. H. Nguyen,An Thai Le,Ngo Anh Vien
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026. Project page: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable general-purpose robotic control via large-scale multimodal pretraining, yet their effectiveness under few-shot imitation learning remains limited. We conduct a systematic stress test of state-of-the-art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future-oriented conditioning framework for data-efficient VLA adaptation. FOCA combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations, enabling long-horizon reasoning in latent space without pixel-level prediction. This formulation naturally supports action-free co-training with synthetic videos from video world models and can be interpreted as learning a future-conditioned value-like representation. Extensive experiments demonstrate FOCA achieves 95.7% success with 20 demonstrations on LIBERO, improves 7-12% on RoboCasa, and delivers up to 26% absolute gains on real robots, establishing a new state of the art in few-shot VLA adaptation.

[CV-277] Stochastic Signed Distance Processes

链接: https://arxiv.org/abs/2606.20856
作者: Hiroki Sakuma,Masatoshi Okutomi
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-view surface reconstruction is a core problem in computer vision. One prominent line of work represents the surface implicitly as a signed distance field (SDF), optimizing it based on the photometric loss between rendered and observed pixel colors. These approaches typically employ SDF-based volume rendering to obtain a differentiable relaxation of discontinuous visibility along rays, thereby reducing reliance on silhouette supervision. In this paper, we reformulate SDF-based volume rendering as probabilistic surface rendering, where each pixel color is modeled as a mixture distribution induced by the random first ray-surface intersection. To this end, we introduce Stochastic Signed Distance Processes (SSDP), which model the SDF along each ray as a stochastic process, inducing a first-passage-time distribution for each ray. We then derive the first-passage probability for each sampling interval based on Bayesian filtering, together with its practical approximation for parallel rendering. We further show that NeuS, an existing SDF-based volume rendering method, arises as a special case of our formulation. Experiments on the DTU and MobileBrick datasets demonstrate that our method outperforms baselines in both surface reconstruction and uncertainty quantification, supporting the effectiveness of our first-passage formulation. Our code is available at this https URL.

[CV-278] ranslating Inference-Time Control to Radiology Vision-Language Models: Activation Steering for Pneumonia Classification on Chest X-rays

链接: https://arxiv.org/abs/2606.20852
作者: Eduardo Moreno Judice de Mattos Farina,Mateus A. Esmeraldo,Felipe Akio Matsuoka,Paulo Eduardo de Aguiar Kuriki,Felipe Campos Kitamura
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time engineering can alter model behavior without fine-tuning. However, its utility for improving diagnostic performance in medical vision-language models (VLMs) remains unclear. We aim to evaluate whether Contrastive Activation Addition (CAA) can improve pneumonia classification in chest radiograph VLMs without updating model weights. Three frozen chest radiograph VLMs (MedGemma-4B-IT, NV-Reason-CXR-3B, and CheXOne-3B) were evaluated on the public Kermany pneumonia test set. Classification was based on the logits of the tokens Yes and No under a binary prompt. Steering vectors included a 30-pair answer-bias control, a 30-pair pneumonia text contrast, and an image-conditioned contrast derived from 30 pneumonia and 30 normal development images. A deterministic 200-image development set was used for layer and scale selection (100 images) and threshold calibration (100 images). Performance was assessed using ROC-AUC, PR-AUC, F1 score, threshold analyses, reverse-vector controls, random-vector controls, and conditional bootstrap confidence intervals. Fixed-threshold F1 improvements were frequently observed but did not consistently indicate improved diagnostic performance. For MedGemma-4B-IT. NV-Reason-CXR-3B showed the strongest benefit: calibrated F1 improved from 0.7692 in the zero-shot setting to 0.8619 with pneumonia-text steering and to 0.8727 with image-conditioned steering. For CheXOne-3B, pneumonia-text steering increased calibrated F1 from 0.8528 to 0.8666, although the confidence interval crossed zero. On this public pneumonia benchmark, CAA substantially altered prediction score distributions and operating characteristics without fine-tuning. Meaningful performance gains were observed in one of three evaluated VLMs, suggesting that activation steering may serve as a lightweight approach for adapting medical VLM behavior.

[CV-279] From Uncertainty to Stability and Fidelity: Guiding Sparse-View 3D Gaussian Splatting with Fisher Information

链接: https://arxiv.org/abs/2606.20842
作者: Junbao Zhou,Qingshan Xu,Yuan Zhou,Xiaolong Shen,Beier Zhu,Kesen Zhao,Yiming Zeng,Chen Bai,Cheng Lu,Hanwang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising technique for novel view synthesis. However, 3DGS requires dense input views to achieve high-quality rendering. In sparse-view scenarios, 3DGS often prones to overfitting, resulting in noticeable artifacts and degraded rendering quality. Previous methods explore to address this issue by introducing additional priors (e.g. depth priors) or integrating regularization techniques (e.g. Dropout). However, these methods are often applied without principled guidance. In particular, prior-based augmentation typically samples novel viewpoints randomly, while Dropout-based regularization randomly removes Gaussians. The compounded randomness introduces uncertainty and instability, limiting the fidelity of novel view synthesis. In this paper, we propose a novel method for sparse-view 3DGS that incorporates Fisher Information to quantitatively guide the utilization of geometric priors and regularization. Specifically, our method comprises two key components: (1) Stereo augmentation with Fisher Information. By leveraging Fisher Information, we actively select most informative supporting views and use depth priors to curate reliable pseudo ground truths, which reduces randomness in augmentation and improves stability and rendering fidelity; (2) Uncertainty-aware regularization. We reduce the instability of Dropout-based regularization by using Fisher Information to quantitatively measure the uncertainty of each 3D Gaussian, and adaptively adjust the removal probability, leading to more stable and effective regularization. With these two components, our method effectively mitigates overfitting and improves the stability of optimization in sparse-view 3DGS, resulting in superior rendering fidelity. Extensive experiments show that our method achieves state-of-the-art performance in sparse-view novel view synthesis benchmarks.

[CV-280] NeoLoc-68: End-to-end 68-point neonatal facial landmark localisation in neonatal clinical environments

链接: https://arxiv.org/abs/2606.20823
作者: Abdullah Bin-Obaid,Maria M. Cobo,Rebeccah Slater,Lionel Tarassenko,Mauricio Villarroel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 6 figures, journal paper

点击查看摘要

Abstract:Facial landmark localisation is a prerequisite for developing automated, non-contact neonatal pain assessment methods. Clinicians use pain scales to judge the severity of pain, many of which rely on facial expression. However, facial landmark detectors trained on adult faces perform poorly in neonatal clinical environments due to frequent occlusions caused by medical equipment, varied head poses, and challenging imaging conditions, including motion blur triggered by sudden pain-related movements. We propose an end-to-end facial landmark detector capable of predicting 68 landmarks on neonatal faces in clinical environments. We combined 37,459 single-face images from 11 public datasets, standardised to 68-point markup, with 1,123 manually annotated frames from a neonatal research dataset (totalling over 76,000 landmarks). A YOLO-based keypoint model was adapted to regress the facial landmarks, initialised with weights from a pretrained neonatal face detector. On public datasets, our proposed model achieved state-of-the-art performance: Normalised Mean Error (NME) = 5.37, Failure Rate (FR) = 12.5%, Area Under the Cumulative Error Curve (AUC) at AUC0.08 = 38.00% and AUC0.1 = 48.70%. On the clinical neonatal test set, before fine-tuning, the model achieved the lowest Detection Failure Rate (DFR) = 5.3% among all baselines and showed strong generalisation. After fine-tuning, performance improved further to NME = 6.36, FR = 22.30%, DFR = 1.77%, AUC0.08 = 29.24% and AUC0.1 = 40.25%. To the best of our knowledge, this represents the first end-to-end 68-point neonatal facial landmark detection model. With further dataset expansion and refinement, it could support downstream tasks in neonatal health monitoring and pain-related facial analysis.

[CV-281] GroundShot: Visually Consistent Multi-Shot Long Video Generation via Entity-Grounded Shot Scheduling

链接: https://arxiv.org/abs/2606.20799
作者: Yixuan Lai,Tianjia Shao,Kun Zhou,Weijia Dou,Siyu Zhu,Jingdong Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating visually consistent multi-shot videos remains an open challenge. As videos span more shots, inconsistencies can accumulate across shots, causing entities that reappear across shots – characters, objects, and locations – to drift away from how they first appear. We observe that viewers judge consistency by comparing each later appearance of an entity with its first clear appearance; the visual quality of this initial appearance sets the consistency ceiling for all that follows. Motivated by this, we present \textbfGroundShot, a training-free, model-agnostic agentic framework for entity-grounded multi-shot generation. GroundShot builds an entity-level visual memory online from accepted generated shots: it schedules shots’ generation order by their expected usefulness as entity references, grounds entities from generated videos, verifies their reliability before adding them to memory, and retrieves suitable entity references from memory before each shot is generated. To evaluate this entity-centered view of consistency, we further introduce \textbfGroundBench, a diagnostic benchmark that measures consistency at the entity level while isolating controlled challenge dimensions. Experiments show that GroundShot improves multi-shot consistency over existing methods while requiring no additional training or model modification.

[CV-282] World Action Models: A Survey

链接: https://arxiv.org/abs/2606.20781
作者: Qiuhong Shen,Shihua Zhang,Yue Liao,Qi Li,Zhenxiong Tan,Shizun Wang,Shuicheng Yan,Xinchao Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 57 pages, 6 figures

点击查看摘要

Abstract:World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at this https URL.

[CV-283] riMotion: Modality-Agnostic Camera Control for Video Generation ECCV

链接: https://arxiv.org/abs/2606.20774
作者: Seunghyun Shin,Jifei Song,Wooseok Jeon,Hae-Gon Jeon,Jiankang Deng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ECCV Accepted

点击查看摘要

Abstract:Camera motion control is essential for directing viewpoint changes in generative systems. However, existing methods typically condition the generation process on a single specific modality, such as explicit pose trajectories or reference videos, limiting their ability to support heterogeneous user inputs. To address this limitation, we present TriMotion, a modality-agnostic framework for camera-controlled video generation that maps video, pose, and text inputs, describing the same camera trajectory into a shared motion embedding space. Learning such a space requires synchronized supervision across modalities. Therefore, we build the Motion Triplet Dataset by extending a Multi-Cam Video Dataset with geometry-grounded motion descriptions derived from camera extrinsics. We further introduce a latent motion consistency objective that leverages the motion embedding space to encourage the generated video to follow the target camera trajectory directly in latent space, avoiding the cost of pixel-space decoding. Extensive experiments show that TriMotion generates high-quality videos that accurately follow the target camera trajectories across all three modalities. Beyond standard generation, the shared motion embedding space also enables flexible applications such as sequential motion composition and cross-modal motion interpolation.

[CV-284] UniSLAD: A Unified Framework for Structural and Logical Industrial Visual Anomaly Detection

链接: https://arxiv.org/abs/2606.20768
作者: Changyi Li,Chao Yang,Yu Xiao,Kari Tammi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: This work has been accepted for publication in the Proceedings of the 2026 IEEE International Conference on Automation Science and Engineering (CASE)

点击查看摘要

Abstract:Visual anomaly detection is a fundamental task in industrial automation. While existing approaches have achieved notable progress in identifying structural defects, the detection of logical anomalies remains relatively underexplored. In practice, structural and logical anomalies frequently co-occur in industrial workflows. Therefore, a solution capable of detecting both structural and logical anomalies is crucial for advancing comprehensive anomaly detection research. To address this limitation, we propose a unified framework, termed UniSLAD, which jointly addresses logical and structural anomalies without additional training, enabling a practical solution for dynamic industrial environments. First, we introduce a dual-feature extractor that synergistically integrates a Convolutional Neural Network (CNN) backbone for local texture perception with a Transformer backbone for global contextual reasoning, yielding richer and more comprehensive representations. Building on this foundation, we design dual-granularity feature representation modules. At the patch level, memory banks enhanced by the Mahalanobis Transform (MT) preserve representative features and support more discriminative anomaly scoring. At the image level, distribution maps are aggregated using Lower-Upper Mean (LUM) and Power Mean Pooling (PMP), yielding a more robust global representation than conventional average pooling. Extensive experiments on the two industrial benchmarks demonstrate that UniSLAD achieves competitive performance in comprehensive anomaly detection, achieving 99.4% and 93.1%, respectively. Furthermore, ablation studies verify the individual contributions and effectiveness of each proposed component.

[CV-285] One Image is All You Need: Agent ic One-Shot Image Generation via Text-Based World Models for Long-Tail Spatial Perception

链接: https://arxiv.org/abs/2606.20764
作者: Keqin Zeng,Shuting Su,Shihao Lin,Ziyue Li,Rui Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable spatial decision automation, such as autonomous driving and maritime surveillance, critically depends on robust visual perception. However, real-world spatiotemporal data exhibits severe heterogeneity, often manifesting as extreme long-tail distributions for safety-critical scenarios. This data scarcity induces dataset shift that degrades detection performance and pose safety risks. While synthetic data generation offers a potential solution, existing generative approaches, such as diffusion models and Generative Adversarial Networks (GANs), often lack explicit spatial grounding and structural constraints, resulting in spatial and physical inconsistencies in generated scenes. To address these challenges, we introduce WMGen-v1, an agentic text-based world model framework for long-tail spatial data generation. WMGen-v1 employs a Large Vision-Language Model (LVLM) to construct a structured scene representation from a single reference image, while a Large Language Model (LLM) performs guidance-based scene expansion under physical plausibility and commonsense constraints. Subsequently, conditioned on the structured semantic representations produced by this reasoning process, a diffusion model generates diverse and physically grounded long-tail training data. Experiments on internal industrial datasets, ROADWork, and LaRS benchmarks demonstrate that WMGen-v1 outperforms baseline approaches. Notably, detectors trained solely on WMGen-v1 synthetic data approach real-only performance on aggregate dataset-level metrics, highlighting its potential to alleviate long-tail data scarcity for downstream spatial perception.

[CV-286] Mirag e: a Clean-Label Backdoor against LiDAR 3D Object Detection

链接: https://arxiv.org/abs/2606.20752
作者: Ziba Parsons,Ang Li
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Deep neural network-based LiDAR 3D object detection serves as a critical perception component in safety-critical autonomous systems. However, recent studies have revealed its vulnerability to backdoor attacks. Existing attacks typically require white-box access or label modification and focus on geometric attacks such as object disappearance or bounding-box manipulation. In this paper, we present Mirage, a black-box and clean-label backdoor attack against deep neural network-based LiDAR 3DOD. Mirage injects a small number of label-consistent poisoning samples into the training set, causing the model to learn a malicious association between a trigger pattern and an attacker-chosen target class while preserving normal training semantics. As a result, the compromised model behaves normally on benign inputs yet systematically misclassifies triggered objects as the target class during deployment. We evaluate Mirage on multiple state-of-the-art LiDAR 3DOD models and benchmark datasets. Experimental results show that Mirage achieves a 73% misclassification success rate with a poisoning rate of only 0.5%, while maintaining detection performance close to that of benign models.

[CV-287] An approach with Visual and Tabular Mamba to multimodal medical data using Mixed Fusion

链接: https://arxiv.org/abs/2606.20738
作者: Matheus B. Rocha,Gustavo B. Dettogni,Renato A. Krohling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages. accepted to 36th Brazilian Conference on Intelligent Systems

点击查看摘要

Abstract:This article presents a complementary approach for integrating multimodal medical data in cancer classification, based on state space models represented by the Mamba architecture. To this end, a mixed multimodal fusion architecture, called Mixed Fusion, was employed and developed to enhance the interpretability of the decision-making process. The proposed approach explores two variants of Mamba: one dedicated to visual processing, responsible for classifying the lesion image and generating probabilities associated with the target classes, and another focused on tabular processing, which uses these probabilities together with clinical and/or sociodemographic data to produce the final diagnosis. The experiments were conducted on two medical datasets: PAD-UFES-20, composed of clinical images and information associated with skin lesions, and NDB-UFES, consisting of histopathological images and sociodemographic data related to oral cancer. The results indicate slightly lower performance in balanced accuracy, compared with Transformer-based approaches, on PAD-UFES-20, and superior performance on NDB-UFES. Additionally, substantial gains were observed in the recall metric. Furthermore, the adoption of the Mixed Fusion architecture enables the application of the Shapley Additive Explanations (SHAP) method, increasing the interpretability of the results. These findings indicate that Mamba-based models constitute a suitable alternative for multimodal classification in medical data, especially in scenarios in which sensitivity is a relevant requirement.

[CV-288] REKEY: Metadata-Grounded Visual-Key Regeneration for Contamination-Resilient VQA Evaluation

链接: https://arxiv.org/abs/2606.20736
作者: Tengjie Lin,Yutao Sun,Jingwei Ni,Shuhan Ge,Hao-Xuan Ma,Yanting Miao,Wangyue Lu,Mingshuai Chen,Tiancheng Zhao,Jianwei Yin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Static visual question answering (VQA) benchmarks age quickly: Once the items leak into training corpora, scores can reflect memorization rather than genuine visual ability, thus obscuring real progress. Rebuilding high-quality benchmarks such as VBench requires substantial human annotation, yet each static release can quickly become another leaked artifact. We propose ReKey, a live benchmark protocol that randomly regenerates the answer-bearing local detail, or visual key, in real images at evaluation time. Using human-validated edit slots, ReKey samples fresh instances with new answers, construction-grounded labels, and controlled visual-search difficulty. On VBench, the ReKey regenerated benchmark reveals a sharp score jump across eight frontier vision-language models (VLMs): The original items score 9.5–18.8 percentage points higher than the regenerated variants. By making the visual key renewable, ReKey keeps evaluation fresh as models and training data evolve.

[CV-289] Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic

链接: https://arxiv.org/abs/2606.20734
作者: Francesca Morandi,Omayma Moussadek,Federico Venturini,Mauro Suardi,Alessandro Banzatti,Francesco Cannarile,Angelo Porrello,Simone Calderara
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by the 22nd International Conference on Advanced Video and Signal-Based Surveillance (AVSS)

点击查看摘要

Abstract:Open Vocabulary Action Recognition (OVAR) enables the recognition of novel actions by leveraging vision-language representations, overcoming the limitations of traditional closed-set approaches. However, achieving robust performance in real-world scenarios typically requires domain-specific fine-tuning, which is often costly and raises privacy and regulatory concerns. In this work, we propose an alternative paradigm that bypasses target-domain training and recombines knowledge from existing datasets and models. Leveraging model merging and task arithmetic, we extract and combine task vectors from models fine-tuned on diverse public OVAR datasets. We show that, in out-of-distribution settings, the resulting merged model achieves superior zero-shot generalization to the pre-trained base model. Code is available at this https URL

[CV-290] XmoPipe: A Pipeline for Large-Scale In-the-Wild Human Motion Dataset Construction

链接: https://arxiv.org/abs/2606.20731
作者: Nathan Salazar,Emmanuel Dellandréa,Mathieu Lefort,Alexandre Meyer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, presented at CASAXR 2026

点击查看摘要

Abstract:Large-scale human motion datasets are essential for training robust motion models for analysis, synthesis, and understanding. While marker-based motion capture provides precise data, it is costly and limited in scale and diversity. Recent advances in monocular motion capture and video-language understanding open the way to extract plausible motion from unconstrained online videos. We present a scalable pipeline for constructing in-the-wild human motion datasets. From a few keywords, the system retrieves videos, extracts 3D body and facial motion, and generates high-level textual descriptions. The pipeline is flexible, enabling targeted collection of various motions, multi-person interactions, or expressive behaviors. We demonstrate its quality by training motion reconstruction and motion generation models, showing performance comparable to models trained on traditional motion capture datasets and strong cross-dataset generalization.

[CV-291] How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding

链接: https://arxiv.org/abs/2606.20726
作者: Yixian Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a compact empirical model that quantifies how answer accuracy degrades as a function of frame budget B and temporal distance D in long video understanding – analyzing performance when recalling content from D seconds in the past using a fraction B of total frames. Long-form models operate under strict budgets, yet no prior framework predicts how accuracy degrades as B shrinks and events recede. We fit a weighted least-squares model on ~155,000 binary predictions across ten models and three sampling strategies, deriving a law where logit-accuracy scales linearly in log-budget with a distance-dependent exponent that decays log-linearly with distance. This budget exponent \alpha(D) captures the marginal value of extra frames at distance D. The law achieves cell-level weighted R^2 = 0.05-0.75 across models. Notably, budget effectiveness at D = 1000 s differs by \approx 7.4\times between the best streaming and base models. STREAMINGVLM achieves \alpha(1000) = 1.26 (95% CI: [1.06, 1.58]), meaning a tenfold budget increase substantially improves long-distance accuracy, while the best Qwen3-VL base model reaches only \alpha(1000) = 0.17 (CI: [0.04, 0.34]). In accuracy space, a 10\times budget increase at D = 1000 s yields +29 percentage points for STREAMINGVLM versus +4 pp for the base model. Sampling strategies show model-dependent trade-offs: random sampling yields higher base sensitivity but steeper distance decay. We demonstrate how \alpha(D) enables principled budget allocation, including a model-ranking reversal at long distance, and propose it as a diagnostic metric for streaming video models.

[CV-292] D2HDMap: Non-visible Driveline Map Prior for Online Vectorized HD Map Prediction

链接: https://arxiv.org/abs/2606.20725
作者: Seojun Shon,Chikao Tsuchiya,Dhaval Bhanderi,David Ilstrup,Hsinmin Cheng,Christopher Ostafew
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 5 tables, to appear in “IEEE intelligent vehicles symposium (IV) 2026 Proceedings”

点击查看摘要

Abstract:Accurate, up-to-date representations of road structures are critical for the safe operation of autonomous vehicles. Existing systems rely either on costly, maintenance-heavy high-definition (HD) maps which compromise safety when outdated, or purely sensor-based online mapping which struggles with long-range reliability and occlusion. Systems incorporating map prior information into online mapping seek to overcome drawbacks of both approaches by combining them in some way. We propose ‘Driveline To HD Map’ (D2HDMap), an online mapping system that injects a lightweight, non-visible driveline prior to guide the estimation of visible road structures such as lane dividers, road boundaries and crosswalks. This prior incurs less effort to create and update compared to full HD map priors used in other approaches. We also show that training with such a prior can improve generalization at inference time when no prior is available. Ablation studies conducted on the nuScenes and Argoverse 2 dataset demonstrate that models trained using a driveline prior largely retain performance even when priors are not available. On a geographically disjoint split, D2HDMap achieves 44.8 mAP, surpassing recent state-of-the-art. Additionally, noise-aware training substantially increases robustness to realistic localization error.

[CV-293] Evaluation of Medical Vision Language Models HuluMed and MedGemma and general purpose chatbots Gemma 3 ChatGPT Plus and Claude Pro on real previously unseen wound images

链接: https://arxiv.org/abs/2606.20723
作者: Yunzhe Xue,Mohammed Saim Ahmed Quadri,Neal Panse,Justin W. Ady,Usman Roshan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chronic wound assessment remains a clinically challenging task that requires accurate interpretation of wound morphology, tissue composition, vascular characteristics, and infection risk. Recent advances in Vision-Language Models (VLMs) have introduced the possibility of automated multimodal wound analysis through image understanding combined with clinical reasoning. This study evaluates the performance of several general-purpose and medically specialized open-source and proprietary VLMs for clinical wound assessment using an expanded, curated dataset of 20 clinically diverse wounds spanning vascular, surgical, ischemic, venous, lymphedema, and amputation-related etiologies. Six VLMs were evaluated using a structured twelve-question clinical framework covering wound classification, infection risk, vascular intervention recommendations, debridement urgency, wound therapy selection, and advanced management planning. Across 20 wound cases and 240 clinician-graded wound-analysis decisions, ChatGPT achieved the highest overall performance with 174/240 correct responses (72.50%), followed by Claude with 149/240 (62.08%). Among the open-source and medically specialized models, HuluMed achieved the strongest performance with 96/240 correct responses (40.00%), followed by Gemma 3 (81/240, 33.75%), MedGemma 4B (62/240, 25.83%), and MedGemma 27B (42/240, 17.50%). The findings suggest that frontier general-purpose multimodal systems currently demonstrate substantially stronger wound-analysis performance than medically specialized alternatives, highlighting the continued importance of broad multimodal reasoning capabilities alongside domain-specific medical knowledge. Although current VLMs demonstrate promising potential for clinical decision support, substantial limitations remain in advanced wound-management reasoning, procedural planning, and autonomous clinical reliability.

[CV-294] MIRAG E: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

链接: https://arxiv.org/abs/2606.20717
作者: Xuelong Dai,Jianyu Ma,Boyang Ma,Biwei Yan,Yijun Yang,Yue Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a trusted web platform where the evaluator acts solely as an unprivileged third party, such as a merchant or advertiser, controlling only a semantically legitimate, spatially constrained region, such as an ad slot, a sponsored card, or a localized widget. Operating under these realistic constraints, we propose MIRAGE, a novel visual indirect prompt injection framework for targeted next-action hijacking. Our approach leverages diffusion models to generate perceptually benign adversarial images strictly confined to the attacker-controlled boundaries permitted by the trusted service provider. To maximize attack efficacy within such a restrictive setting, we introduce a robust optimization technique combining curvature-aware adversarial diffusion guidance with sparse, dark-pixel residual perturbations. Comprehensive evaluations against prominent MLLM web agent frameworks, specifically SeeAct and OpenClaw, empirically demonstrate the potency, realism, and stealth of our proposed MIRAGE.

[CV-295] CDER-SME: A Cross-Device Event-RGB Micro-Expression Dataset under Multi-Level Stress Induction

链接: https://arxiv.org/abs/2606.20715
作者: Jingting Li,Hui Sha,Su-Jing Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression recognition (MER) in realistic scenarios demands high temporal sensitivity and ecological validity, yet existing benchmarks are largely constrained to laboratory-controlled settings and rigid hardware-coupled sensing. We introduce CDER-SME, a cross-device Event-RGB dataset collected under a multi-level stress induction framework (cognitive and social) to elicit spontaneous emotional leakage. To enable reproducible acquisition with independent, decoupled sensors, we provide a hardware-agnostic alignment pipeline for temporal synchronization and landmark-guided spatial registration. CDER-SME adopts a three-tier structure with 92 subjects and 1,963 expert-annotated samples (Action Units and emotions), including 790 Event-RGB pairs and 210 high-fidelity aligned pairs. We further report a reproducible multimodal baseline, where cross-modal fusion improves performance over single-modality counterparts, supporting the complementarity of event dynamics and RGB cues. By removing the need for coaxial calibration, CDER-SME offers a practical benchmark for cross-device alignment and deployable Event-RGB MER in real-world affective intelligence.

[CV-296] Video2Code: Generating Interactive Webpages from UI Videos via Action-Aware Revisit

链接: https://arxiv.org/abs/2606.20711
作者: Mingde Xu,Zhen Yang,Yan Wang,Yu Wang,Xijun Liu,Zijun Dou,Wenyi Hong,Xiaotao Gu,Bin Xu,Jie Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 21 figures

点击查看摘要

Abstract:UI videos provide a natural input for generating interactive webpages, as they capture both webpage appearance and action-triggered state transitions. However, directly applying video-capable vision-language models to this task remains insufficient. Existing models typically rely on sparse sampling or compressed temporal representations, which may miss short action boundaries and break the state-action-state transitions needed to implement webpage behavior. We formulate UI video-to-code generation as executable state-transition recovery from interaction videos, and identify this failure mode as state-transition misalignment. We introduce Video2Code, an action-aware video-to-code approach for recovering executable UI state transitions. Rather than allocating the visual budget uniformly across the video, Video2Code first performs coarse video understanding to locate action-critical regions, then invokes a temporal clipping tool to revisit these regions at higher temporal resolution before generating HTML/CSS/JavaScript code. We instantiate Video2Code with action-aligned video-code supervision and evaluate it under both visual and functional criteria. Experiments show that Video2Code substantially strengthens the underlying open-source model for UI video-to-code generation, improving functional correctness over direct video observation, especially on dense multi-step interactions.

[CV-297] Style V2: Beyond Content-Preserving Style Transfer with Self-Distillation and Distribution-Matching-Distillation

链接: https://arxiv.org/abs/2606.20709
作者: Shiwen Zhang,Yifan Xu,Haibin Huang,Chi Zhang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Given a content reference and a style reference, content-preserving style transfer requires the model to generate stylized outputs with content and style consistency. We introduced TeleStyle V1 to tackle this problem. However, TeleStyle V1 is trained with photorealistic content reference and artistic style reference, which makes it incapable to cope with artistic content reference and realistic style reference in most cases. In this paper, we designed a Self-Distillation data synthesis strategy to construct such triplets from TeleStyle V1. Trained with such self-distilled triplets, our TeleStyle V2 supports Content-Style references in the forms of Realistic-and-Realistic (RnR), Realistic-and-Stylized (RnS), Stylized-and-Realistic (SnR), Stylized-and-Stylized (SnS). In addition, we found Distribution Matching Distillation could preserve the general text-guided image editing capability of the foundation model and fix the content consistency degradation caused by SFT process. Through quantitative evaluations, our TeleStyleV2-QIE-2509-DMD performs at least on par with Qwen-Image-Edit-2509-DMD, demonstrating strong general image editing skills beyond content-preserving style transfer. We observed the content/style reference order confusion problem in TeleStyle V1 and further introduced prompt enhancer to solve it. TeleStyle V2 uses Qwen-Image-Edit’s VLM encoder, Qwen2.5-VL-7B, to generate content prompt and style prompt for free. TeleStyle V2 could achieve comparable style transfer performance with state-of-the-art commercial model, gemini-3-pro-image-preview.

[CV-298] GEOPHYS: The Geometry of Physical Plausibility

链接: https://arxiv.org/abs/2606.20707
作者: Christian Internò,Alexander Pondaven,Habon Issa,Fabio Pizzati,Francesco Pinto,Markus Olhofer,Ivan Laptev,Philip Torr,Eero P. Simoncelli,Barbara Hammer,David Klindt
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encoders. In aggregate, we call them GEOPHYS. First, we show that these signals correlate with human EEG responses to two forms of object-permanence violations. Second, GEOPHYS robustly discriminates physically implausible videos from realistic ones, achieving state-of-the-art physics-violation detection: 98.3% on LikePhys and 93.3% on IntPhys2, whereas V-JEPA 2, GPT-4o, Gemini, and twelve modern video diffusion models perform near chance. Third, used as a best-of-N verifier for physical alignment during video generation, GEOPHYS lifts MAGI-1 24B from 50.01% to 64.50% on PhysicsIQ at 1.5x lower wall-clock and 4.65x lower memory than the V-JEPA 2 world-model verifier. Ultimately, GEOPHYS demonstrates that physical plausibility in videos can be assessed by leveraging the emergent geometric properties of temporal features extracted from image encoders.

[CV-299] MotionPyramid: Hierarchical Motion Representation and Residual Interfaces

链接: https://arxiv.org/abs/2606.20705
作者: Gao Zhu,Zaishuo Xia,Yubei Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We ask whether the representational hierarchy seen in perception, from local primitives such as edges to higher level structures such as parts and objects, can be established for motion. In humanoid control, low level actions specify immediate motor commands, while meaningful behavior is organized over longer temporal scales, including contacts, gait fragments, balance recovery, reaching, and whole body skills. We introduce MotionPyramid, a hierarchical action representation that learns such structure from motion data. Starting from a motion tracking teacher, it trains a recursive stack of latent decoders: low level latents decode to immediate full body motor commands, while higher level latents unfold through lower levels into temporally extended motion programs. After pretraining, the hierarchy is frozen and reused by downstream reinforcement learning policies as a family of action interfaces at different control resolutions. Experiments show the learned levels form a motion hierarchy: coarser interfaces improve early learning and motion regularity by constraining exploration to structured segments, while finer interfaces preserve feedback control and final task precision. Representation probes show the hierarchy supports traversal, interpolation, transition, and qualitative composition, exposing editable control handles across temporal scales. Finally, we introduce Residual Interfaces, letting a downstream policy maintain coarse, segment level, and frame level residual commands through the frozen hierarchy. Analogous to residual or skip connections in deep networks, this allows coarse motion programs and fine residual corrections to coexist within one controller. MotionPyramid shows that motion, like perception, can be organized into a reusable multi level representation, providing structured abstraction without sacrificing controllability.

[CV-300] Robust Image-Driven Phenotyping of Ovarian Tumor Cells using Optimized Dynamic Features in Hyperbolic Channels

链接: https://arxiv.org/abs/2606.20703
作者: Hong-Fei Li,Xi-Lin Gao,Yi-Juan Xiang,Shu-Song Huang,Yi-lin Wang,Chun-Dong Xue,Zhuo Yang,Yong-Jiang Li,Xu-Qu Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Label-free, image-based cellular mechanophenotyping in microfluidic devices provides a high-throughput method for single-cell profiling. However, while complex microchannels (e.g., hyperbolic geometries) reveal transient deformation dynamics under continuous extensional stress, the resulting high-dimensional feature spaces are highly susceptible to hydrodynamic artifacts. Flow rate variations often distort discriminative boundaries, linking feature distributions to fluid conditions rather than intrinsic biology. To overcome this, we introduce a stability-guided analytical framework that decouples flow-induced noise from authentic mechanobiological signatures. We tracked the morphodynamic, kinematic, and intracellular optical-density trajectories of healthy and malignant ovarian cells to build a 93-dimensional feature space. Using a cross-flow screening strategy based on structural consistency and statistical persistence, we isolated robust descriptors, creating task-adapted subsets (20 features for binary classification; 25 for cancer subtyping). Variance-attribution analysis confirmed the neutralization of flow-conditioned artifacts; notably, flow-associated variance in the primary principal component fell from 69.9% to 9.3% in the subtyping task. We also found that macroscopic binary discrimination depends on bulk kinematic transitions, while clonal subtyping requires localized intracellular optical heterogeneity. These optimized subsets maintained diagnostic fidelity across multiple machine learning architectures and restricted sampling conditions. This framework establishes a robust, flow-independent foundation for continuous dynamic phenotyping.

[CV-301] Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting

链接: https://arxiv.org/abs/2606.20702
作者: Eirini Baltzi,Dionysis Christopoulos,Sotiris Spanos,Valsamis Ntouskos,Konstantinos Karantzalos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have sparked growing interest in zero-shot Earth Observation (EO) downstream tasks, with further gains enabled by remote-sensing-adapted models. We examine this setting across 17 VLM variants and 12 remote sensing (RS) datasets under Meta-Prompting for Visual Recognition (MPVR), and show that zero-shot performance remains highly sensitive to textual design choices, from the meta-prompts used to guide the LLM in generating class descriptions to the descriptions themselves. We explore why semantically rich LLM-generated class descriptions do not translate into consistent gains over simple domain-adapted CLIP-style descriptions. While LLM descriptions are more semantically expressive, they can also introduce noise in the text embedding space, reducing robustness in downstream tasks. We support this observation through a text log-likelihood analysis in the whitened CLIP feature space, comparing LLM-generated and template-based descriptions. Building on this finding, we study query embedding calibration and show that lightweight calibration of the query space consistently yields strong improvements in zero-shot classification and retrieval. Overall, our results provide practical insight into the trade-off between semantic richness and robustness, and identify embedding calibration as a simple and effective tool for improving zero-shot remote sensing performance.

[CV-302] AEF-Econ: Toward Plug-and-Play Socioeconomic Foundation Embeddings from AlphaEarth for Urban Remote Sensing

链接: https://arxiv.org/abs/2606.20697
作者: Shuyang Hou,Ziqi Liu,Haoyue Jiao,Lutong Xie,Yaxian Qing,Xiaopu Zhang,Qingyang Xu,Zhangyan Xu,Xuefeng Guan,Huayi Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AlphaEarth Foundations (AEF) unify global remote sensing foundation embeddings through multimodal self-supervised learning, but their pretraining focuses on physical land-surface signals, limiting plug-and-play use in socioeconomic tasks. We integrate seven heterogeneous data streams across 36 Chinese cities over eight years - AEF embeddings, population, nighttime lights, remote sensing indices, points of interest (POIs), urban morphology, and cross-lingual text - and construct CHN-Econ, a socioeconomic benchmark with 16 labels in three categories. We conduct 31 controlled experiments along five axes: fusion architecture, self-supervised objective, text integration, embedding dimensionality, and normalization. Used alone as a linear probe, AEF achieves R2 values of only 0.301 for cross-region and 0.160 for cross-tier evaluation. The five-axis ablated backbone improves these scores to 0.832 and 0.671, respectively, but reveals that low-dimensional semantic streams are consistently suppressed by high-dimensional streams under shared reconstruction. To address this bottleneck, we propose Capacity-Adaptive Reconstruction (CAR), replacing shared reconstruction with per-stream decoders and stream-level losses to mitigate inter-stream capacity competition. CAR further raises cross-region and cross-tier R2 to 0.848 and 0.693, and restores collapsed labels from negative R2 to a stable range. Using CAR, we infer 14.4 million pixels across 36 cities and eight years and release AEF-Econ, including 128d and 64d compressed versions. Self-diagnostics and case studies show that AEF-Econ captures cross-city hierarchies and intra-urban spatial organization under unsupervised settings, providing a socioeconomic remote sensing foundation embedding complementary to AEF physical embeddings.

[CV-303] Spatio-Temporal Wildfire Spread Prediction in Canada using a Video Swin-Hybrid-U-Net and Satellite Imagery

链接: https://arxiv.org/abs/2606.20693
作者: Maulik Srivastava,Esha Saha,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 4 figures. Preprint submitted to the International Journal of Wildland Fire

点击查看摘要

Abstract:Background: Wildfires in Canada present increasing threats to ecosystems, communities, and infrastructure, demanding accurate forecasting tools to aid mitigation efforts. Existing models often lack scalability or fail to capture temporal dynamics effectively. Aims: This study aims to develop a deep learning framework tailored to Canadian wildfire spread prediction that captures spatio-temporal patterns in environmental data. Methods: We propose a U-Net architecture integrating a Video Swin Transformer encoder with a convolutional decoder to model three-day sequences of meteorological and environmental variables. Data are exclusively sourced from public repositories via Google Earth Engine, ensuring transparency and scalability. The model is trained and tested on a curated dataset of major Canadian wildfire events from 2014 to 2023. Key results: Our approach achieves strong predictive performance by effectively leveraging spatio-temporal attention to forecast next-day fire incidence maps. Conclusions: The model successfully captures complex wildfire dynamics unique to Canada’s landscape and temporal variability. Implications: This framework paves the way for advanced spatio-temporal wildfire forecasting research and operational applications using publicly accessible datasets.

[CV-304] NeoJaundice-AI: Smartphone-Based Neonatal Jaundice Detection Using Dual-Input Deep Learning and Synthetic Augmentation

链接: https://arxiv.org/abs/2606.20689
作者: Rahul Patel,Nirjala Jarpula
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 10 figures, 8 tables. IEEE conference format

点击查看摘要

Abstract:Neonatal jaundice (hyperbilirubinemia) is one of the most common conditions affecting newborns worldwide, with India alone recording roughly 15 million cases per year. Early detection is critical, yet standard diagnosis requires blood tests that are often impractical in rural clinics where laboratory facilities are limited. This paper presents NeoJaundice-AI, a smartphone-based screening system that uses photographs of a baby’s skin and sclera (eye white) to estimate jaundice severity and predict serum bilirubin levels in under three seconds without requiring internet connectivity. The proposed system is built on a dual-branch EfficientNet-B0 architecture that independently processes skin and sclera images. Deep features are fused with handcrafted YCbCr color statistics to jointly perform four-class severity classification and continuous bilirubin regression. A key contribution is a synthetic jaundice generation method that simulates bilirubin-induced yellowing through controlled YCbCr channel modifications on normal neonatal skin images. This approach addresses data scarcity, particularly for severe jaundice cases and darker Indian skin tones (Fitzpatrick Types IV to VI). In addition, a skin-tone normalization module improves prediction consistency across diverse neonatal complexions. Experimental results demonstrate an overall classification accuracy of 91.8 percent, a clinical sensitivity of 93.5 percent, and a bilirubin mean absolute error of 1.4 mg/dL. After INT8 quantization and ONNX conversion, the model size is reduced to 8.3 MB while maintaining inference times below three seconds on standard Android devices. To the best of our knowledge, this is the first India-focused neonatal jaundice AI system that combines multimodal image fusion, skin-tone adaptation, synthetic data augmentation, and fully offline mobile deployment within a single framework. Comments: 7 pages, 10 figures, 8 tables. IEEE conference format Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: I.4.9; J.3 Cite as: arXiv:2606.20689 [cs.CV] (or arXiv:2606.20689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.20689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-305] ARGUSTRACK: A Multi-View Annotation System for Multi-Object Tracking

链接: https://arxiv.org/abs/2606.20687
作者: Hao Vo,Duc Nguyen,Ngan Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-Camera Multi-Target (MCMT) tracking has emerged as a critical capability for applications ranging from autonomous driving to animal behavior monitoring. While recent advances have yielded sophisticated tracking algorithms, the availability of annotated multi-view data remains a significant bottleneck. Existing annotation tools predominantly support single-camera workflows or rely on LiDAR sensors, making cross-view labeling tedious and impractical for camera-only setups. We present ARGUS-TRACK, a multi-camera annotation system that addresses these limitations by enabling annotators to work directly on a bird’s-eye-view (BEV) plane. Given calibrated camera parameters, a single ground-plane annotation is automatically projected into 2D bounding boxes across all relevant views, inherently ensuring identity consistency without manual cross-view alignment. To further accelerate the labeling process, ARGUSTRACK incorporates two complementary mechanisms: a Temporal Aware module that propagates annotations from preceding frames to initialize new ones, requiring only minor positional adjustments; and a Multi-camera Semi-annotation module that leverages off-the-shelf 2D detectors combined with foot-point estimation to automatically generate candidate BEV positions for annotator verification. We evaluate ARGUSTRACK through a pilot study on multi-camera broiler tracking and demonstrate that it substantially reduces annotation time compared to conventional single-camera labeling workflows.

[CV-306] Shear-Free Viewport Magnification for 360-Degree via Spherical Mobius Boosts

链接: https://arxiv.org/abs/2606.20684
作者: Boyang Li,Hezhao Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Viewport-adaptive 360-degree imaging seeks to allocate a fixed sampling budget to the region a viewer is likely to observe. Existing view-biased projections increase viewport resolution through non-conformal warps, which can introduce anisotropic stretching and shear. We formulate spherical Mobius boosts as exact conformal maps for fixed-budget viewport magnification. The continuous spherical warp has quasiconformal dilatation K = 1, reallocating samples toward a target direction while preserving local angles. On a SUN360 saliency audit with 72 panoramas and 216 paired viewport targets, C1 Mobius boosting improves viewport PSNR over optimized offset cubemap on all paired cases, with case-level median gain +3.26 dB, image-level median gain +3.23 dB, and panorama-level bootstrap 95% CI [+3.15, +3.33] dB. Pareto analysis shows that this is not a free global-quality improvement: C1 trades full-sphere WS-PSNR for shear-free viewport fidelity. Prediction-error and filtering studies identify the operating envelope: strong boosts are useful for accurately targeted viewports, while large target uncertainty calls for weaker boosts or fallback. These results position Mobius boosting as a geometric primitive for prediction-conditioned foveated 360-degree resampling rather than a universal encode-once layout.

[CV-307] Open Annotations and Synthetic Data for Field Localisation in Indian Bank Cheques

链接: https://arxiv.org/abs/2606.20682
作者: Jaganadh Gopinadhan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated cheque processing requires localising key fields (date, legal amount, IFSC code, account number, signature, and payee name) before any recognition step. The IDRBT Cheque Image Dataset is, to our knowledge, the only public collection of Indian bank cheques, but it ships without field annotations and with no stated licence, so its redistribution terms are unclear. We address both limitations. First, we release six-field bounding-box annotations for all 112 cheques in the dataset, distributed annotations-only and keyed to the original filenames so that the IDRBT redistribution terms are respected. Second, we release 295 fully redistributable synthetic cheque images produced by a cut-paste pipeline that composites annotated field regions from real cheques onto content-erased, bank-specific canvas templates; because patches are pasted at their source coordinates, annotations carry forward unchanged. Third, we provide a ResNet-50 direct-regression baseline that predicts all six fields in a single forward pass, and use it for a controlled test of the synthetic data. The test is sobering: because cheque layouts are rigid, a no-learning baseline that simply predicts each field’s mean training box already reaches 0.691 mean IoU and 80% accuracy at IoU = 0.5, and once seed variance and training compute are accounted for, the cut-paste synthetic data yields no measurable improvement over real data alone (an equal-compute real-only model matches or beats the synthetic-augmented model on every aggregate metric). We report this negative result in full, since it cautions against assuming appearance-only augmentation helps fixed-layout documents and points instead to layout-varying synthesis. The annotations and synthetic images are released as reusable resources on the Hugging Face Hub under permissive licences. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 68T45, 68T07, 68U10 ACMclasses: I.4.8; I.5.4; I.7.5 Cite as: arXiv:2606.20682 [cs.CV] (or arXiv:2606.20682v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.20682 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jaganadh Gopinadhan [view email] [v1] Sun, 14 Jun 2026 02:59:51 UTC (1,132 KB) Full-text links: Access Paper: View a PDF of the paper titled Open Annotations and Synthetic Data for Field Localisation in Indian Bank Cheques, by Jaganadh GopinadhanView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-06 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-308] A UAV-Based Multi-Modal Vision System for Automated Sideslope Deformation Monitoring and Hazard Detection

链接: https://arxiv.org/abs/2606.20681
作者: Jingfeng Zhang,Yi Li,Xianchong Liang,Huan Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 14 figures

点击查看摘要

Abstract:Slope hazards constitute a major safety threat to expressway infrastructure, and their evolution is typically manifested as slow surface deformation. Conventional manual inspection suffers from low efficiency and inadequate operational safety, especially on severely deteriorated slopes. Accordingly, there is an urgent need for an automated, high-precision solution capable of large-area slope observation and analysis. This study aims to develop a highly automated workflow for slope hazard detection using Unmanned Aerial Vehicle (UAV)-borne Light Detection and Ranging (LiDAR). The proposed workflow consists of a shared data-acquisition and ground-surface extraction stage, a single-observation hazard-screening branch based on RandLA-Net, and a multi-epoch deformation-monitoring branch based on grid-wise elevation differencing. To validate the effectiveness of the proposed system, we conducted multiple UAV-borne LiDAR data-acquisition flights in real expressway slope environments. The results show that the workflow can extract usable ground-surface point clouds under vegetation cover, identify potential hazard zones from single-observation point clouds, and quantify centimeter-level elevation changes using multi-epoch grid differencing. This study establishes an end-to-end UAV-borne LiDAR-based workflow for slope inspection and demonstrates its feasibility through controlled experiments, field tests, and simulation-based validation, thereby providing an implementable solution for automated slope-hazard monitoring and intelligent early warning.

[CV-309] Beyond ROC-AUC: Operating-Point Performance Reporting for Biometric Verification

链接: https://arxiv.org/abs/2606.20680
作者: Ajan Ahmed,Masudul H. Imtiaz
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:A biometric verifier is often deployed with a strict false match budget, so only a narrow, low false match rate (FMR) slice of the score range is used. A reporting standard for this setting already exists. ISO/IEC 19795-1 asks for error rates at stated operating points, for the detection error tradeoff (DET) curve as the view of the trade-off between FMR and the false non-match rate (FNMR), and for an interval of uncertainty on every value. In practice, a single area under the receiver operating characteristic curve (ROC-AUC), the equal error rate (EER), or a verification accuracy is still reported as the resolution, which is a threshold-independent summary that the standard does not endorse. The full ROC-AUC averages the true match rate (TMR) with equal weight over the whole FMR range from 0 to 1, so almost all of its weight is placed where the system is never operated; low-FMR behavior can then be hidden, and the order of two systems can even be reversed. The guideline is revisited in this paper and tested against seven pretrained matchers across four modalities, face, voice, iris, and fingerprint, each reported with bootstrap confidence intervals and paired bootstrap tests. A system that looks stronger on full ROC-AUC is shown to be significantly worse at FMR = 10^-3. For face, a higher full AUC was obtained by FaceNet, whereas a higher TMR at FMR = 10^-3 was obtained by ArcFace, and both gaps were significant with non-overlapping intervals. Hence, the DET curve and the FNMR at a fixed FMR are re-iterated in this paper as the primary report, with ROC-AUC and EER retained as supplementary context.

[CV-310] MemoryVAM: Integrating Memory into Video Action Model for Robot Manipulation

链接: https://arxiv.org/abs/2606.20679
作者: Yuxin Jiang,Chang Yu,Yunuo Chen,Xiang Feng,Yin Yang,Nishank Gite,Chenfanfu Jiang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video-world-model policies learn action-relevant representations by predicting future observations. However, they condition on only a short observation window, which renders long-horizon manipulation non-Markovian when the correct action depends on earlier events that are no longer visible. We present MemoryVAM, an episodic memory mechanism for video-world-model policies. We employ a Recap-Cue (RC) module, in which a Perceiver-based Recap Compressor maps per-frame CLIP embeddings into compact memory tokens, and a lightweight Cue Gate estimates task completion from memory and language. These tokens are injected into both the video backbone and the action decoder, aligning policy imagination with episode progress and conditioning actions on history. Our model trains the memory module with video prediction, a delta-reconstruction auxiliary loss, and episode-boundary supervision, requiring no per-frame progress labels. The same mechanism applies to UNet and Diffusion Transformer (DiT) backbones by changing only the cross-attention injection interface. On LIBERO-Mem, our model improves average success from 5% to 42.5%. On real robots, it achieves 78.3% success on counting tasks, 80.0% on spatial recall, and 75.0% on sequential tracking. Project page: this https URL

[CV-311] Democratizing and accelerating AI-driven pathology research through agent ic intelligence

链接: https://arxiv.org/abs/2606.20677
作者: Jiabo Ma,Cheng Jin,Yihui Wang,Hao Jiang,Ling Liang,Yingxue Xu,Junlin Hou,Zhengrui Guo,Zhengyu Zhang,Yifei Xia,Hongyi Wang,Fengtao Zhou,Zhe Xu,Huajun Zhou,Jiarui Ouyang,Qian Zeng,On Ki Tang,Eunhyang Park,Carolyn Glass,Ronald Cheong Kin Chan,Li Liang,Hao Chen
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 4 figures

点击查看摘要

Abstract:Computational pathology has advanced rapidly with the emergence of foundation models, yet widespread adoption remains limited by substantial technical complexity and programming requirements. Here we present PathLab, an autonomous agentic framework that translates natural-language research objectives into executable and validated computational pathology workflows through the structured composition of domain-specific skills and tools. By organizing workflow generation around reusable methodological modules, including data preprocessing, model development, evaluation and interpretation, PathLab enables studies to be specified at the level of scientific intent rather than implementation details. We evaluated PathLab across 12 public datasets spanning four representative task families: region-of-interest classification, whole-slide image classification, segmentation and survival prediction. Across all task categories, PathLab achieved non-inferior performance relative to expert implementations, while consistently enforcing semantic validity of user prompts and proactively rejecting incompatible workflow specifications prior to execution. In controlled user studies, PathLab substantially reduced the time required to generate executable analytical pipelines and enabled domain experts without programming experience to independently design, execute and evaluate computational pathology studies. Together, these results establish PathLab as a reliable interface between biomedical intent and computational execution, enabling computational pathology studies to be designed at the level of scientific questions rather than programming expertise. By lowering technical barriers to advanced AI methodologies, PathLab provides a foundation for the broader democratization of computational pathology.

[CV-312] NeuroShield: A Device-Agnostic Foundation Model for EEG Authentication

链接: https://arxiv.org/abs/2606.20673
作者: Matin Fallahi,Patricia Arias-Cabarcos,Thorsten Strufe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A central challenge in EEG authentication is that models are typically tied to the acquisition settings in which they are trained. In particular, variations in headset hardware, channel layout, and signal duration create heterogeneous recordings that existing models are not designed to handle, causing each new headset or dataset to be treated as a separate model-development problem. This fragmentation limits multi-dataset learning, hinders knowledge transfer, and reduces model reusability. To address this limitation, we present NeuroShield, a reusable foundation model for EEG authentication that learns identity-discriminative embeddings from variable-channel and variable-length EEG recordings through a dual-stage transformer architecture. We pretrain NeuroShield on three public EEG datasets comprising 15,762 subjects and 28,116 sessions, and evaluate transfer on two unseen downstream datasets. Our evaluations show that, after fine-tuning, NeuroShield reduces equal error rate by 0.44–8.06 percentage points relative to the state of the art. NeuroShield further generalizes to segments longer than those seen during training and operates across channel layouts not encountered during pretraining. These results establish NeuroShield as a reusable and adaptable EEG identity encoder across heterogeneous recording settings. We release NeuroShield as open source to support reproducibility and community adoption.

[CV-313] A Projection-Based Surrogate Gradient Interpretation for Neural Codec Wrappers

链接: https://arxiv.org/abs/2606.20671
作者: Esteban Pesnel,Julien Le Tanou,Michael Ropert,Aline Roumy(COMPACT),Thomas Maugey(COMPACT)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Neural wrappers are learned pre-and postprocessing networks designed to enhance the performance of conventional video codecs. Although these approaches can significantly improve compression efficiency, training them remains challenging due to the non-differentiability of video codecs, which arises from the multiple discrete decisions involved in the encoding process. Surrogate gradients have recently emerged as an effective solution for enabling end-to-end learning with conventional codecs. They offer two main advantages: they avoid training an additional network to mimic the codec, and they can improve compression performance. In particular, the recently proposed SCALED method, which leverages the true compression error, has shown strong results for training neural pre-processors such as downscalers. However, this SCALED gradient was originally introduced as a reparameterization trick, which limits its interpretability. In this paper, we show that this surrogate gradient can be interpreted as a first-order local approximation of the video codec, providing insight into its effectiveness. We further demonstrate that it is effective not only for learning downscaling operations, but also for the more challenging task of full neural wrapping with pre-and post-processing networks. Finally, we show that the approach generalizes well across different video codecs, quality factors, and tasks, including multiple downscaling ratios, yielding BD-Rate (PSNR) reductions of up to -23.59% on x264 and -20.07% on VVenC relative to standard resampling baselines.

[CV-314] SPARC: A Multi-Agent System for Electrical Circuit Question Answering

链接: https://arxiv.org/abs/2606.20643
作者: Mushtari Sadia,Zhenning Yang,Umme Habiba Lamia,Nishat Shawrin,Ang Chen,Amrita Roy Chowdhury
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi-agent system that answers questions over circuit diagrams by grounding reasoning in executable physics-based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute improvement over baselines, while enabling systematic error diagnosis.

[CV-315] A Viscosity Semigroup Framework for Stable Image Reconstruction

链接: https://arxiv.org/abs/2606.20620
作者: Arina Oberoi
类目: Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
备注:

点击查看摘要

Abstract:Starting from the axiomatic formulation of scale-space theory, we develop a viscosity-solution framework for multiscale image representations arising from degenerate elliptic-parabolic partial differential equations. Rather than introducing a new semigroup theory, we work within the standard viscosity-solution setting, using comparison principles to obtain well-posedness, uniqueness, and contraction in the supremum norm. This perspective is used to motivate a hybrid reconstruction operator in which a learned inverse map is followed by a nonlinear diffusion evolution. At the continuous level, the diffusion operator satisfies non-expansiveness, which provides stability for the reconstruction process; this framework is then evaluated on a CT-based mesothelioma classification task, where it attains an AUC of 0.875 with negligible variation across epochs, while the baseline model acquires AUC values from 0.49 to 0.80 without a clear convergence pattern. These observations are consistent with the stabilizing role suggested by the discussed viscosity theory.

[CV-316] CourseBlueprint: A Structured Pipeline for Adaptive Pedagogical Video Generation Grounded in Course Corpora

链接: https://arxiv.org/abs/2606.20608
作者: Md Zabirul Islam,Md Motaleb Hossen Manik,Ge Wang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative text-to-video systems can produce visually fluent educational clips, but they rarely encode the pedagogical content knowledge (PCK) needed for effective instruction, including prerequisite-aware sequencing, learner-adaptive depth, and sustained cognitive engagement. We present CourseBlueprint, a course-grounded pipeline for adaptive pedagogical video generation. Given a topic and learner persona, the system generates a structured teaching blueprint in a single forward pass over an undergraduate biomedical-imaging corpus (BMED 2300; twenty-three lectures, 1,116 slides). Instead of ad-hoc prompt chaining, the pipeline uses typed intermediate representations with validation: a scaffolding module builds a stage-labeled prerequisite concept graph with deterministic cycle removal, an adaptive controller assigns per-concept style specifications, and an engagement generator produces narration following a fixed hook-retrieval-core-analogy-forward contract. A deterministic slide-image override further grounds the rendered video by reusing instructor slides whenever retrieval confidence is high. We also release a reusable benchmark corpus and an evaluation harness combining repeated LLM-judge scoring with regex-grounded objective metrics. In a five-topic ablation, removing the engagement contract reduces the engagement score from 5.00 to 1.20, the adaptive score from 4.80 to 3.40, Flesch readability from 38.0 to 19.8, and analogy and retrieval-prompt counts to near zero. The slide-image override converts a 0/9 corpus-grounding failure into 9/10 successful slide matches on the same topic. These results show that pedagogical video quality depends less on surface fluency than on explicit, typed instructional contracts that make scaffolding, adaptation, engagement, and grounding auditable.

[CV-317] IDY: Thermal Infrared Image Denoising via Wavelet Domain Entropy and Directional Stripe Index

链接: https://arxiv.org/abs/2606.19813
作者: Tai Hyoung Rhee,Dong-Guw Lee,Ayoung Kim
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Thermal infrared (TIR) imaging has been a popular choice for field robotics due to its robust perception capability under low light visual degradation, but it suffers from severe stochastic and fixed-pattern noise that breaks downstream estimation. This noise is intensified indoors due to low thermal contrast and uniform temperature distributions, contributing to the relative lack of indoor TIR deployments. Existing TIR denoising methods exhibit a poor accuracy-efficiency tradeoff, either too slow for online deployment required in robotics or insufficiently robust to severe degradation, while typically being trained on synthetic noise. Addressing these problems, we propose TIDY, a lightweight wavelet-domain denoiser trained on real clean-noisy TIR data. By reformulating TIR denoising in the wavelet domain, TIDY explicitly disentangles noise from structural content, enabling targeted suppression with reduced spatial complexity, significantly improving inference speed over prior methods (~34Hz). TIDY introduces two new metrics, Wavelet Entropy and Wavelet Directional Stripe Index, as complementary loss terms to explicitly suppress stochastic noise and stripe artifacts. Across severe indoor corruption and zero-shot settings, TIDY improves robustness and yields consistent gains in downstream robotics tasks including thermal inertial odometry and monocular depth estimation. Code and dataset is available at: this https URL

[CV-318] A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

链接: https://arxiv.org/abs/2510.23798
作者: Gauthier Grimmer,Romain Wenger,Clément Flint,Germain Forestier,Gilles Rixhon,Valentin Chardon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

[CV-319] owards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

链接: https://arxiv.org/abs/2508.02521
作者: Andrea Di Pierno(1),Luca Guarnera(2),Dario Allegra(2),Sebastiano Battiato(2) ((1) IMT School of Advanced Studies, (2) University of Catania)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA’s robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at this https URL.

[CV-320] Automatic Vehicle Detection using DETR: A Transformer-Based Approach for Navigating Treacherous Roads

链接: https://arxiv.org/abs/2502.17843
作者: Istiaq Ahmed Fahad,Abdullah Ibne Hanif Arean,Nazmus Sakib Ahmed,Mahmudul Hasan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic Vehicle Detection (AVD) in diverse driving environments presents unique challenges due to varying lighting conditions, road types, and vehicle types. Traditional methods, such as YOLO and Faster R-CNN, often struggle to cope with these complexities. As computer vision evolves, combining Convolutional Neural Networks (CNNs) with Transformer-based approaches offers promising opportunities for improving detection accuracy and efficiency. This study is the first to experiment with Detection Transformer (DETR) for automatic vehicle detection in complex and varied settings. We employ a Collaborative Hybrid Assignments Training scheme, Co-DETR, to enhance feature learning and attention mechanisms in DETR. By leveraging versatile label assignment strategies and introducing multiple parallel auxiliary heads, we provide more effective supervision during training and extract positive coordinates to boost training efficiency. Through extensive experiments on DETR variants and YOLO models, conducted using the BadODD dataset, we demonstrate the advantages of our approach. Our method achieves superior results, and improved accuracy in diverse conditions, making it practical for real-world deployment. This work significantly advances autonomous navigation technology and opens new research avenues in object detection for autonomous vehicles. By integrating the strengths of CNNs and Transformers, we highlight the potential of DETR for robust and efficient vehicle detection in challenging driving environments.

[CV-321] PHAST-Net: Attention-Guided Physics-Informed Network for Unified Estimation of Ideal Time-Frequency Representations

链接: https://arxiv.org/abs/2606.23665
作者: James M. Cozens,Simon J. Godsill
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce PHAST-Net, an attention-guided, physics-informed network for unified estimation of Ideal Time-Frequency Representations (ITFRs), spanning spectral, tempo-based, metrical, and harmonic representations such as Spectrograms, Tempograms, and Metrograms. PHAST-Net learns an application-general mapping from a constellation of wavelet transforms, the proposed Continuous Log-frequency Adaptive Wavelet Transform (CLAWT), to high-resolution, cross-term-suppressed time-frequency (T-F) representations. The proposed constellation of CLAWTs is selected through Cohen’s class kernel analysis to maximise curvature coverage in a logarithmic-frequency T-F plane tailored to harmonic signal structure. PHAST-Net further incorporates a proposed physics-informed auxiliary reprojection loss designed to reconstruct the idealised observed CLAWT constellation from the predicted ITFR and the corresponding Cohen’s class kernels during training. This auxiliary objective promotes transform consistency and energy conservation, mitigates pathological target sparsity, and enhances optimisation stability. Attention layers further promote effective cross-term suppression across the input constellation. The log-frequency formulation also enables Harmonic PHAST-Net, which estimates a Harmonic ITFR that isolates fundamental structure, supporting robust fundamental-only representations for speech and music, such as derived fundamental Tempograms and Metrograms. We further introduce Spline-PHAST-Net, which parameterises detected and associated T-F ridges as continuous spline trajectories, enabling arbitrary-grid re-rendering and signal reconstruction. Trained on an effectively unbounded procedurally generated dataset, PHAST-Net demonstrates improved accuracy over established approaches, providing a unified framework for high-resolution, cross-term-robust analysis of speech, music, and broader nonstationary signals.

[CV-322] NGPS: Structure-Preserving Self-Supervised Denoising via Neighbor-Guided Patch Sampling ECCV2026

链接: https://arxiv.org/abs/2606.23200
作者: Jaehyun Cho,YoungJoon Yoo
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The 19th European Conference on Computer Vision: ECCV 2026

点击查看摘要

Abstract:Neighboring-slice self-supervised denoising is attractive for volumetric medical imaging, yet inter-slice misalignment breaks anatomical correspondence and often yields ghosting and blurred margins when adjacent slices are used naively as targets. We propose Neighbor-Guided Patch Sampling (NGPS), a lightweight framework that constructs neighboring supervision under local inter-slice misalignment without explicit registration. To avoid learning from misleading targets, prior methods commonly mask discrepant regions, but this stabilizes training at the cost of leaving a non-trivial portion of neighboring evidence unexploited, particularly around high-frequency anatomical boundaries. NGPS addresses this by decoupling structure matching from signal retrieval: for each masked location, it searches a local neighborhood for structurally similar candidate patches using a simple guide image (e.g., fast bilateral filtering), while retrieving the supervision signal directly from the raw noisy neighbor at the matched coordinates. By matching on a noise-attenuated guide while retrieving raw values from neighboring slices, NGPS constructs local pseudo targets without a learned registration module. Across the evaluated CT and synthetic-Rician MRI settings, NGPS improves fidelity and structure-sensitive metrics. Code is available at this https URL .

[CV-323] IViT: A Novel Interpretable Visual Transformer for Skin Disease Detection

链接: https://arxiv.org/abs/2606.22892
作者: Haibiao Li,Di Lin,Xue Jiang,Weiwei Wu,Yanxi Li,Yugang Chi
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The clinical diagnosis of skin diseases is susceptible to interference from inter-class similarity of skin lesions, and over-reliance on clinicians’experience easily leads to subjective bias. Although existing deep learning aided diagnosis methods achieve competitive accuracy, they suffer from the black-box opacity of Vision Transformer (ViT) and poor adaptability to medical few-shot scenarios. Moreover, mainstream explainable algorithms generally face the bottleneck of significant accuracy degradation when improving interpretability. This paper proposes an interpretable ViT (IViT) constrained by Quadratic Programming (QP). The introduced pre-trained transfer learning adapts to few-shot feature extraction. A discrete QP feature selection framework is constructed to screen generic and discriminative features consistent with clinical diagnostic logic. A multi-objective loss function is designed to reduce feature redundancy and optimize activation distribution while preserving classification performance. Experimental results on six standard skin disease datasets show that IViT achieves an accuracy of 93.80%, only 0.21% lower than the baseline, with feature redundancy reduced by 29.5%. Its core activation regions are consistent with clinically concerned lesion areas. The proposed model balances accuracy and interpretability, providing a reliable solution for the clinical deployment of few-shot intelligent skin disease diagnosis.

[CV-324] Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

链接: https://arxiv.org/abs/2606.22382
作者: Yosuke Yamagishi,Atsushi Takamatsu,Mototsugu Sato,Tomohiro Kikuchi,Shouhei Hanaoka,Takeharu Yoshikawa,Osamu Abe
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Purpose: To evaluate whether large language model (LLM)-assisted label cleaning can identify label-report discordance in CT-RATE, a large-scale public chest CT dataset. Materials and Methods: After report-level deduplication, 24,446 unique radiology reports were identified. Twelve reports were excluded from the primary GPT-5.4 analysis because of Microsoft Azure AI Foundry content-safety filtering, leaving 24,434 reports and 439,812 label instances across 18 abnormality categories. GPT-5.4-derived binary labels were generated from report text using structured JSON output and compared with existing CT-RATE labels. Discordant instances were adjudicated by radiologists. In addition, 100 randomly sampled reports were manually annotated to compare CT-RATE labels, individual LLM-derived labels, and multi-LLM majority-vote labels against radiologist-annotated reference labels. Results: Overall agreement between GPT-5.4-derived and CT-RATE labels was 96.4%, with Cohen’s kappa of 0.884. Lymphadenopathy showed the lowest agreement and kappa. In discordance review, radiologist adjudication supported GPT-5.4-derived labels in 72 of 97 (74.2%) general discordant instances and 91 of 99 (91.9%) targeted lymphadenopathy discordant instances. Against radiologist-annotated reference labels, multi-LLM majority-vote labels achieved the highest label-macro-averaged F1 score and Cohen’s kappa. Conclusion: LLM-assisted label cleaning identified clinically meaningful label-report discordance in CT-RATE and may support scalable quality improvement of public imaging datasets. The cleaned dataset will be made publicly available to support future research. Comments: 17 pages Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.22382 [eess.IV] (or arXiv:2606.22382v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.22382 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yosuke Yamagishi [view email] [v1] Sun, 21 Jun 2026 08:01:06 UTC (668 KB)

[CV-325] ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors

链接: https://arxiv.org/abs/2606.22371
作者: Yixin Gao,Xiaohan Pan,Lin Liu,Xin Li,Zhibo Chen,Qi Tian
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent generative video compression methods leverage powerful generative priors to achieve perceptually pleasing reconstructions. However, most existing approaches require additional training to adapt generative models to produce realistic reconstructions from compact representations. In this paper, we propose ZeroGVC, a zero-shot generative video compression framework that leverages pretrained autoregressive diffusion priors for low-delay video reconstruction. ZeroGVC encodes the first frame of each group of pictures (GOP) with an image codec and represents subsequent P-frames through Codebook-Guided Autoregressive Latent Compression. This design is motivated by our observation that the compression scheme of denoising diffusion codebook models is effective in few-step consistency sampling. By selecting compact combinations of reproducible codebook noise vectors, ZeroGVC steers the latent denoising trajectory toward the target P-frame while allowing the decoder to reproduce the same trajectory in only a few denoising steps. In addition, we design an optional bidirectional reference mode that mitigates error propagation by leveraging the next I-frame context without introducing any additional bitrate overhead. Extensive experiments on standard video compression benchmarks demonstrate that ZeroGVC achieves superior perceptual reconstruction quality at ultra-low bitrates without any additional training.

[CV-326] Specificity- and Calibration-Aware Breast Ultrasound Segmentation via Entropy-Guided Boundary Supervision

链接: https://arxiv.org/abs/2606.22308
作者: Manar Alsaid,Mandip Shrestha,Mohammad Abbas
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 figures, 15 pages, International Conference on Bioinformatics and Biomedicine (BIBM) 2026 at Dallas

点击查看摘要

Abstract:Lesion segmentation in breast ultrasound involves two related challenges. In images with lesions, speckle noise, low tissue contrast, and posterior acoustic shadowing cause boundary leakage and incomplete contour delineation. In images without lesions, those same artifacts generate false-positive activations in regions resembling solid lesion tissue. This study addresses both failure modes through a single modification to the training objective. Rather than weighting every boundary pixel equally, the proposed loss scales contour penalties by per-pixel predictive entropy and the ground-truth boundary map, concentrating gradient emphasis on lesion margin locations where the network remains uncertain. The loss was evaluated on the BUSI dataset through a controlled ablation against two baselines: a model without boundary supervision and a model with uniformly weighted boundary binary cross-entropy. Across 97 lesion-containing test images, mean Dice scores were statistically indistinguishable between the proposed method and the no-boundary baseline (0.7624 versus 0.7616, paired Wilcoxon p = 0.27), confirming that lesion segmentation quality is preserved. The primary effect appears in specificity. False-positive activations on 20 no-lesion test images fell from 14 of 20 and 19 of 20 for the two baselines to 5 of 20 with the proposed approach (McNemar p = 0.012 and 0.0005). Non-overlapping Wilson 95% confidence intervals confirm the difference is both statistically significant and practically substantial. A post-hoc spatial temperature scaling step further reduced expected calibration error from 0.0201 to 0.0095 without altering segmentation masks. Entropy-guided boundary supervision and spatial calibration thus function as complementary training-level and inference-level refinements that improve specificity and probability reliability within a U-Net framework.

[CV-327] Delta-Diffusion: Modeling Longitudinal Brain Amyloid-PET Trajectories via Conditional Poisson Diffusion Bridge

链接: https://arxiv.org/abs/2606.22216
作者: Yongheng Sun,Minhui Yu,Mengqi Wu,Maureen Kohi,Mingxia Liu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While longitudinal brain PET imaging is the gold standard for quantifying the spatiotemporal accumulation of Beta-amyloid, its widespread clinical utility is constrained by high operational costs and cumulative radiation risks. Recent deep generative models show promise in longitudinal image synthesis; however, they often fail to capture subtle pathological progression due to identity drift and a persistent bias toward trivially replicating baseline signal intensities rather than modeling temporal transition. To this end, we propose Delta-Diffusion, a novel progression-aware framework that redefines longitudinal PET synthesis as a conditional Poisson Diffusion Bridge (PDB) process. Unlike standard diffusion models that start from Gaussian noise, our PDB formulation is mathematically anchored to the subject’s baseline PET, effectively transforming the generative task into a conditional distribution transition of the amyloid trajectory. To handle heteroscedastic nature of PET imaging, we introduce a physically-grounded Poisson perturbation within a Diffusion Transformer (DiT). This architecture uses adaptive scale-shift modulation to precisely calibrate the synthesis with the elapsed clinical interval and structural MRI context. A volume-of-interest balanced objective is designed to emphasize sparse, high-risk regions of amyloid accumulation. Validated on two cohorts with 542 subjects, Delta-Diffusion demonstrates superior performance in capturing longitudinal variations in amyloid deposition compared to state-of-the-art methods, offering a robust computational framework for tracking disease progression.

[CV-328] Scaling up fine-grained intracranial vessel annotations in computed tomography angiography

链接: https://arxiv.org/abs/2606.21756
作者: Chu-Hsuan Lin,Alberto Mario Ceballos-Arroyo,Jisoo Kim,Shrikanth M. Yadav,Huaizu Jiang,Lei Qin,Geoffrey S. Young
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:In this work, we present SemanticVessel, a dataset for fine-grained brain vessel segmentation in computed tomography angiography scans. Based on the detailed contrast provided by dynamic 4D-CTA scans, we generate segmentation traces for arteries and veins. We then use intensity-guided region growing to obtain segmentations of the majority of vascular territories in the human brain, which are refined and annotated with 20 unique arterial classes by an expert radiologist. Unlike existing datasets, where minor arteries are discarded as background content, we merge these minor arteries into a generic arterial class. Due to the multiple-phase acquisition of dynamic 4D-CTA, labels for a single phase can be re-used for other phases in the same series, greatly increasing the size of our dataset with no additional annotation cost. The results show that models trained with the additional generic artery class produce better fine-grained segmentations across the board. We will make our code, annotation GUI, and model weights available to the scientific community. Code, weights, and data will be made available on this https URL

[CV-329] Configurable Algorithms for Histopathologic Cancer Detection on Quantum Hardware

链接: https://arxiv.org/abs/2606.21752
作者: Nandika Goyal,Glen Uehara,Andreas Spanias
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Histopathologic cancer detection is challenging due to tissue variability, staining differences, and subtle visual distinctions between disease classes. We propose two quantum algorithms for this task: a configurable dual-gradient CSWAP circuit (DG-CSWAP) that computes multi-directional edge responses in a single execution via per-pixel local Ry encoding, and a hardware-efficient destructive swap circuit (DG-DST) natively matched to quantum processing unit (QPU) gate sets at substantially lower circuit complexity. We prove algebraic equivalence between DG-CSWAP and DG-DST, enabling a two-circuit QPU validation strategy. A three-stage NISQ mitigation pipeline, including readout error correction, bias subtraction, and slope regression, reduces single-pixel hardware MSE by ~8x. Validated on five quantum processors via Amazon Braket, the method achieves inter-platform Pearson r ~ 0.93-0.94 across all local-simulator pairs. Compared to a prior Quantum Fourier Transform (QFT) based amplitude-encoding baseline requiring 12-qubit global state preparation and a three-model ensemble (85.55% on PatchCamelyon), the proposed method uses shot-based measurements, executes on real quantum hardware, and achieves 79.80% accuracy with a single ResNet-50. A Lite configuration delivers a 17x preprocessing speedup at a 2.59% accuracy cost. To the best of our knowledge, this is the first quantum hardware implementation study with noise mitigation for histopathologic image classification.

[CV-330] Adaptive Beam Selection for Efficient Scanning Probe Tomography ICASSP-2026

链接: https://arxiv.org/abs/2606.21713
作者: San Dinh,Zichao Wendy Di,Matt Menickelly
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint for ICASSP-2026 paper

点击查看摘要

Abstract:In X-ray tomography, reconstruction quality generally improves with larger numbers of projections. However, more projections increase experiment costs, acquisition time and the radiation dose imparted to the sample. One mitigation to these trade-offs is to adopt a sequential design of experiments, in which each subsequent measurement is determined as a function of previously acquired data in order to maximize information gain. In practice, a widely used heuristic to maximize information is to align beams with the edges of the sample. A key challenge, however, is that the true sample is unknown, so identifying edge-aligned beams typically requires reconstructing the sample based on available measurements. This work proposes a novel sequential design method that identifies edge-aligned measurements directly from the sinogram, bypassing any reconstruction, thereby improving computational efficiency and reducing the experimental design’s susceptibility to reconstruction errors. Our method dynamically selects the next set of measurement beams by maximizing an acquisition function that balances exploration and exploitation over the domain of all possible measurements, improving reconstruction quality while reducing measurement redundancy.

[CV-331] PaaF: Raising the perceived quality of INR-Based Image Compression

链接: https://arxiv.org/abs/2606.21655
作者: Lorenzo Catania,Dario Allegra
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have recently emerged as a promising paradigm for image compression, offering a fundamentally different approach from traditional and learned codecs. Nevertheless, INR-based methods for image compression suffer from long encoding times and a consistent performance gap in classic quality metrics such as PSNR. In this work, we explore the potential of purely INR-based compression methods and we propose PaaF (Picture as a Function), a novel INR-based image codec that introduces improved architectural design, adaptive quantization, and an efficient entropy coding scheme. These components are designed to enhance rate-distortion performance while preserving the simplicity and parallelizability of INR-based decoding. Experimental results demonstrate consistent improvements over existing INR-based methods in both quantitative metrics and perceptual quality. These findings highlight the potential of INR-based approaches and contribute to narrowing the gap between functional representations and more established compression paradigms.

[CV-332] Deep Unrolled Networks in Representation Space Applied to MRI Reconstruction

链接: https://arxiv.org/abs/2606.21602
作者: Efe Ilıcak,Baris Imre,Chloé Najac,Ruben van den Broek,Beatrice Lena,Andrew Webb,Marius Staring
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Deep unrolled networks (DUNs) integrate physical forward models with learned regularization in cascaded network architectures, achieving exceptional performance in inverse problems while maintaining interpretability. While most DUNs operate in the object domain (e.g., image space), recent variants explored representation spaces for improved information flow. However, these methods rely on heuristic methods for data consistency (DC), sacrificing fidelity with measurements. In this work, we introduce DUNE (Deep Unrolled Networks in rEpresentation space), a framework that maintains exact adherence to physical measurements while operating in learned representation spaces. By deriving the DC gradient via the chain rule and implementing it through the Vector-Jacobian Product (VJP), we enable exact backpropagation of measurement residuals into the representation space. This formulation supports diverse architectural backbones, including pre-trained encoders to guide the iterative process. We assess DUNE against state-of-the-art baselines on accelerated MRI reconstruction tasks, demonstrating that exact VJP-based gradients yield superior reconstruction quality and structural fidelity across both single-channel portable low-field and multi-channel clinical high-field MRI acquisitions. The code will be available upon publication at this https URL. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2606.21602 [eess.IV] (or arXiv:2606.21602v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.21602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-333] Unsupervised Susceptibility Distortion Correction of EPI without Calibration Scans via Image Translation-Based Registration

链接: https://arxiv.org/abs/2606.21588
作者: Wooseung Kim,Sung-Hong Park
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Functional magnetic resonance imaging (fMRI) utilizes echo-planar imaging (EPI) to capture blood-oxygen-level-dependent (BOLD) signals with high temporal resolution. However, EPI is inherently sensitive to magnetic field inhomogeneities, resulting in susceptibility-induced geometric distortions along the phase-encoding (PE) direction. To correct these distortions, conventional approaches rely on additional calibration scans, such as field maps or reverse PE acquisitions, which are not always available in practice. To overcome this limitation, we propose SACRED, a calibration scan-free susceptibility distortion correction framework that corrects geometric distortions via image translation-based registration using only a routinely acquired anatomical T1-weighted (T1w) image and a unidirectional PE BOLD image. SACRED employs an invertible neural network as the image translation backbone to bridge the contrast gap between BOLD and T1w images while enforcing structural consistency through a modality independent neighborhood descriptor. This design enables the use of a mono-contrast similarity objective to train the registration network in an unsupervised manner without requiring distortion-corrected BOLD images. In addition, we incorporate test-time adaptation (TTA) to further enhance performance on out-of-distribution (OOD) data at inference time. SACRED was evaluated on one in-distribution (ID) dataset and two OOD datasets, and was compared with representative fMRI distortion correction methods. The results demonstrate that SACRED significantly outperforms competing methods on both ID and OOD datasets, exhibiting robustness to scanner and population shifts, partly enabled by TTA. The code will be made publicly available upon acceptance.

[CV-334] A Skin-Tone-Aware Dual-Representation Remote Photoplethysmography Framework for Contactless Respiratory Rate Estimation

链接: https://arxiv.org/abs/2606.21511
作者: Trishna Saikia,Anup Kumar Gupta,Puneet Gupta,Pasi Liljeberg
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 7 tables. Keywords: respiratory rate estimation, remote photoplethysmography (rPPG), skin-tone awareness, dual-representation learning, contrastive learning, RR-rPPG dataset, COHFACE

点击查看摘要

Abstract:Respiratory rate is a vital indicator of pulmonary and cardiovascular health, yet conventional methods for estimating respiratory rate are often intrusive due to their contact-based nature. Remote photoplethysmography offers a promising non-contact alternative and has been widely used for heart rate estimation; however, its potential for respiratory rate estimation remains underexplored. Existing methods typically adapt green and chrominance-based projections originally designed for heart rate estimation, which only partially capture respiratory dynamics. Most prior work focuses on the Eulerian representation with fixed or empirically selected RGB projections. To address these gaps, we propose a skin-tone-aware dynamic RGB signal projection that captures respiratory information. To mitigate the sensitivity of the Lagrangian representation to non-respiratory motion, we introduce a denoising network for motion-based remote photoplethysmography signals. We further design a phase-independent contrastive loss that enables Eulerian and Lagrangian representations to collaboratively learn respiratory rate information. We also introduce RR-rPPG, a respiratory-rate facial video dataset with Indian demographic representation. We evaluate the method on RR-rPPG and the publicly available COHFACE dataset, where it consistently outperforms comparison methods and achieves up to a 42.1% reduction in mean absolute error across the evaluated settings. The proposed framework demonstrates the effectiveness of jointly leveraging skin-tone-aware Eulerian and denoised Lagrangian representations for contactless respiratory rate estimation from facial videos. In addition, RR-rPPG contributes a diverse benchmark resource for future research in remote respiratory monitoring. The code and dataset will be made publicly available upon paper acceptance. Comments: 14 pages, 8 figures, 7 tables. Keywords: respiratory rate estimation, remote photoplethysmography (rPPG), skin-tone awareness, dual-representation learning, contrastive learning, RR-rPPG dataset, COHFACE Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.21511 [eess.IV] (or arXiv:2606.21511v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.21511 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anup Kumar Gupta [view email] [v1] Fri, 19 Jun 2026 15:06:34 UTC (10,639 KB)

[CV-335] 2D Versus 3D Diffusion for In Silico Training of Interventional X-ray AI Models

链接: https://arxiv.org/abs/2606.21414
作者: Sampath Rapuri,Jeremy Ko,Benjamin D. Killeen,Russell H. Taylor,Mathias Unberath
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to synthesize realistic X-ray images has catalyzed the development of AI models for X-ray image-guided procedures, which otherwise suffer from a lack of available annotated data. Prior work has demonstrated the effectiveness of mechanistic simulation of digitally reconstructed radiographs (DRRs) as a training data source for a myriad of tasks, including segmentation and anatomical landmark detection, with comparable or superior performance to real data training. However, mechanistic DRR synthesis still relies on the availability of annotated high-resolution anatomical models. Deriving these from CT images of real patients or specimens imposes an undesirable bottleneck on data quantity and variability. In this work, we explore two methods for synthesizing training data: (1) a 3D conditional latent diffusion model that generates CT volumes to use as inputs for mechanistic DRR generation without real, 3D anatomical models, and (2) a view-conditioned 2D diffusion model that produces synthetic X-rays. In controlled experiments, we demonstrate that synthetic 2D diffusion-based X-rays can be used to train an anatomical landmark detection model that generalized to real X-ray images with performance rivaling that of a model trained on real X-ray images. Thus, we provide preliminary evidence that synthetic, 2D diffusion-based training data can substitute for real X-ray data, identifying a promising avenue towards generating large, diverse datasets for training robust AI models in interventional X-ray imaging.

[CV-336] Non-line-of-sight imaging with arbitrary relay surface geometries via 3D Gaussian Transient Rendering

链接: https://arxiv.org/abs/2606.21270
作者: Yi Wang,Ziyu Zhan,Yuran Wang,Hao Wang,Qiang Liu,Zuoqiang Shi,Lingyun Qiu,Xing Fu
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imaging objects hidden outside the direct line of sight expands the effective field of view and is critical for applications such as autonomous driving and robotic perception. Despite impressive progress in time-of-flight (ToF)-based non-line-of-sight (NLOS) imaging, real-world deployment remains challenging because practical measurements are often collected over spatially limited, arbitrarily shaped relay regions-conditions that violate the planar-wall and dense-sampling assumptions made by most existing methods. To address these limitations, we propose a LOS-guided NLOS imaging pipeline that imposes no geometric assumptions on the relay surface and naturally supports both confocal and non-confocal configurations. Our method represents the hidden scene using 3D Gaussian primitives and couples them with an efficient, differentiable transient rendering model, enabling end-to-end optimization directly from measured transients. We validate our approach on real-world measurements from both a public dataset and a custom-built capture system. Across settings, our method achieves state-of-the-art reconstruction fidelity under spatially limited, sparsely sampled conditions, and significantly outperforms existing methods on complex, arbitrary relay surface geometries.

[CV-337] Anatomically Consistent TMJ Disc Segmentation via Semantic Anchoring and Clinical Priors

链接: https://arxiv.org/abs/2606.21177
作者: Dayun Ju,Chanyoung Kim,Sunyoung Jung,Hyo-Jung Jung,Chena Lee,Younjung Park,Seong Jae Hwang
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Segmenting the temporomandibular joint (TMJ) disc from MRI is essential for accurate diagnosis of internal derangement, yet it remains unreliable in practice due to its small size, low contrast, and morphological variability. Existing methods, primarily adapted from general segmentation architectures, often produce fragmented or anatomically inconsistent masks, leading to unstable measurements of disc position and shape for downstream diagnosis. To address these challenges, we propose TISC, a TMJ disc segmentation framework that integrates semantic anchoring with clinical metadata-guided boundary refinement. The framework first establishes robust disc localization in the foundation model feature space via a Prototypical Semantic Anchoring (PSA) module that aggregates adjacent-slice MedDINOv3 features and derives a prototype-driven similarity map. It then performs targeted boundary refinement through a Clinical-Metadata Point Refinement (C-MPR) module, with point-wise predictions modulated by Mouth Open Limitation (MOL), a clinical indicator associated with disc displacement without reduction. On a large-scale cohort of 2,488 PD MRI volumes from 1,300 patients, our method achieves up to a 4.96 Dice improvement over strong baselines across diverse architectures, delivering more anatomically coherent and clinically reliable TMJ disc segmentation.

[CV-338] MoECodec: Image Compression for joint human and machine perception via Mixture-of-Experts

链接: https://arxiv.org/abs/2606.21033
作者: Jiancheng Zhao,Xiang Ji,Yifan Zhan,Zunian Wan,Yinqiang Zheng
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image compression for machines calls for a unified codec that serves multiple downstream vision tasks. Existing approaches either adopt task-specific end-to-end designs, raising parameter and deployment overhead, or rely on transfer-based adaptations that remain externally attached and heuristic task design. A key limitation shared by both lines of work is their largely static computation pattern, which applies similar transformations across tokens despite the fact that different image regions exhibit markedly different semantic importance and complexity for machine perception. We propose MoECodec, a token-aware image compression framework that supports multiple downstream tasks within a single model. MoECodec replaces the FFN layers in transformer-based compression model token-wise Mixture-of-Experts (MoE), enabling dynamic, token-level computation conditioned on the input content and task objective. To make MoE effective in compression model, we introduce a stable routing strategy that combines expert-choice routing with spatial total variation regularization to encourage spatially coherent assignments, and we propose a lightweight expert architecture, Group Shuffle MLP (GShMLP), to control parameter growth. Extensive experiments show consistent improvement against baselines on both conventional image reconstruction and machine tasks.

[CV-339] FlowCodec: One-Step Flow Prior for Generative Image Compression

链接: https://arxiv.org/abs/2606.21030
作者: Yinhuan Huang,Hao Cao,Pu chen,Wenqi Guo,Zhijin Qin
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based image compression methods, leveraging powerful generative priors, have demonstrated remarkable perceptual quality at ultra-low bitrates. However, adapting modern generative models to image compression often relies on carefully engineered conditioning or auxiliary branches, together with substantial retraining, and these costs grow as the models scale. This motivates an open question: Can stronger generative priors be integrated into compression through a simpler, more extensible design? To answer this, we propose FlowCodec, a streamlined framework that plugs pretrained large-scale text-to-image priors (e.g., Qwen-image-2512 and FLUX.1-dev) into ultra-low-bitrate codecs. FlowCodec decomposes the pipeline into two decoupled stages: (1) Latent Compression, which maps clean latents to bitrate-constrained noisy latents; and (2) Latent Transport, which leverages the pretrained prior to refine the noisy latents toward the clean ones in a single step. Notably, FlowCodec requires neither additional conditioning signals nor auxiliary networks. Furthermore, with lightweight adaptation, it can flexibly support multiple bitrates while keeping the number of trainable parameters below 0.54% of the generative backbone. Experiments show that FlowCodec preserves high visual quality at bitrates below 0.05 bits per pixel. The Qwen-image variant significantly outperforms existing methods in terms of LPIPS and DISTS, while both variants deliver higher PSNR and clearly faster encoding than existing one-step diffusion-based methods, with the FLUX variant also maintaining competitive decoding speed.

人工智能

[AI-0] CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

链接: https://arxiv.org/abs/2606.23680
作者: Sikai Li,Shuning Li,Zhenyu Wei,Yunchao Yao,Chenran Li,Mingyu Ding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Humanoid loco-manipulation is often simplified into a stop-and-go process: walking to an object, stopping to manipulate it, and then resuming locomotion. It also commonly relies on low degree-of-freedom (DoF) end effectors that behave like an open-close grasp primitive. We introduce CoorDex, a learning pipeline that converts high-dimensional body and dexterous hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation on the move. Starting from simulated whole-body and hand demonstrations, CoorDex trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses the frozen priors as the action space for downstream residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads, preserving natural whole-body motion while improving finger-level contact reliability. CoorDex enables a Unitree G1 humanoid with a 20-DoF WUJI hand to execute dexterous manipulation while in motion, including non-stop bottle grasping and carrying, fridge door opening on the move, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction all fail under the same reward budget, while the latent-prior interface and coordinated residual structure make high-dimensional contact-rich loco-manipulation trainable. Project Page: this https URL

[AI-1] Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

链接: https://arxiv.org/abs/2606.23676
作者: Dingzhi Yu,Hongyi Tao,Yuanyu Wan,Luo Luo,Lijun Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.

[AI-2] PsyBridge: A Hybrid Intelligent Framework for Multi-Dimensional Mental Health Assessment and Decision Support

链接: https://arxiv.org/abs/2606.23673
作者: Sunil Wanjari,Manish Thakre,Aayushi Asole,Sharwari Raut,Kwabena Adu-Duodu,Yinhao Li,Stanly Wilson
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mental health assessment commonly relies on isolated screening instruments or data-driven models that often lack interpretability and multi-dimensional integration. Existing approaches frequently focus on individual indicators such as depression or anxiety while providing limited support for comprehensive and explainable decision-making. To address this limitation, this study proposes PsyBridge, a hybrid intelligent decision-support framework designed for multi-dimensional mental health assessment through the integration of clinically validated screening tools, cognitive evaluation, and personality profiling within a unified architecture. The proposed framework incorporates PHQ-9 and GAD-7 assessments alongside cognitive and behavioural indicators using a modular design and a weighted aggregation mechanism to generate interpretable mental health risk classifications and recommendations. To evaluate the framework, a semi-synthetic dataset consisting of 500 patient profiles representing varying severity levels was constructed based on clinically grounded score distributions. Experimental results demonstrate that PsyBridge achieves an overall accuracy of 0.84, outperforming standalone PHQ-9 and GAD-7 assessments while improving precision, recall, and F1-score. Sensitivity analysis and ablation studies further indicate that integrating cognitive and personality components contributes to more stable classification performance and reduces inconsistencies in moderate-risk prediction. The findings suggest that PsyBridge provides a scalable and interpretable approach for AI-assisted mental health decision support, particularly within digital healthcare and telehealth environments.

[AI-3] aching LLM s String Matching Backtracking and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles

链接: https://arxiv.org/abs/2606.23672
作者: Prateek Agnihotri,Sanchit Jain,Prabhat Agnihotri,Aditya Prasad,Shubham Jain
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures, 2 tables. 7th Place Solution for the NVIDIA Nemotron Model Reasoning Challenge (Kaggle)

点击查看摘要

Abstract:This paper presents our algorithmic innovations for the NVIDIA Nemotron Model Reasoning Challenge, focusing on Bit Manipulation Puzzles. In this task, the objective is to discover a hidden logical rule transforming input binary strings to outputs, then apply it to unseen inputs. Large Language Models (LLMs) notoriously struggle here; traditional methods force them to simulate complex boolean logic and arithmetic, leading to hallucinations. Furthermore, the search space of bitwise operations (combinations of shifts, rotations, and logic gates) suffers from a severe combinatorial explosion. To overcome this computational intractability, we present a novel approach that abandons arithmetic logic entirely in favor of string similarity, structured search, and autonomous error recovery. Our core contributions are: 1. Bases and Truth Table Formulation: We reframe logic-gate deduction into a base-selection task, leveraging string similarity (minimal bit flips) to isolate primitive transformations (“bases”) and deduce truth tables without complex arithmetic. 2. Backtracking DFS and Error Recovery: We formalize a search process that tests candidate bases, detects logical collisions across examples, and backtracks upon failure to perform robust error recovery. 3. Bit Tokenization and Interactive Reasoning SFT: We force the tokenizer to encode binary strings as individual single-bit tokens. We use dynamic masking to simulate external oracle feedback, training the model to hypothesize, self-evaluate, and backtrack natively. Evaluated on bit manipulation puzzles, our approach achieved over 96% validation accuracy. This represents the highest performance in this category, driving our 7th Place overall finish in the contest. Comments: 22 pages, 4 figures, 2 tables. 7th Place Solution for the NVIDIA Nemotron Model Reasoning Challenge (Kaggle) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.23672 [cs.AI] (or arXiv:2606.23672v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.23672 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] ailorMind: Towards Preference-Aligned Multimodal Content Generation

链接: https://arxiv.org/abs/2606.23643
作者: Hengji Zhou,Ye Liu,Yufeng Liu,Si Wu,Lianghao Xia,Liqiang Nie
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures, 6 tables. Code available at this https URL

点击查看摘要

Abstract:Personalized content systems depend on available UGC and struggle when suitable content is absent, delayed, or costly to create. Although multimodal generators can synthesize content on demand, how to translate behavioral traces into generation-ready preferences remains underexplored. We study personalized multimodal content generation: creating user-tailored multimodal content without existing item pools or waiting for matching UGC. We propose TailorMind, linking collaborative preference modeling with controllable multimodal generation. TailorMind enriches sparse user histories via hypergraph collaborative filtering and optimizes textual profiles with ranking-error feedback and textual gradient descent. Retrieval-augmented style control grounds outputs in authentic UGC patterns, while cross-modal cohesion reflection reduces semantic drift. We construct TailorBench, a benchmark from three mainstream platforms evaluated along five dimensions: coherence, novelty, aesthetic, hallucination, profiling. Experiments show that TailorMind achieves competitive or stronger coherence, improves novelty and aesthetic quality over representative generation baselines and ground-truth UGC, demonstrating advantages over retrieving available content or comparable UGC, while achieving up to 29% Recall gains in reranking. Our code is released at: this https URL.

[AI-5] Learning Process Rewards via Success Visitation Matching for Efficient RL

链接: https://arxiv.org/abs/2606.23640
作者: Raymond Tsao,Andrew Wagenmaker,Sergey Levine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.

[AI-6] AI Exposure Scores: what they measure what they miss and what comes next

链接: https://arxiv.org/abs/2606.23633
作者: Campbell Lund,Thomas Euyang,Zanele Munyikwa,Marzieh Fadaee
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:A set of exposure scores calculated in 2023 has become a central empirical input to the future of work debate. Produced by Eloundou et al. (2023) and referred to here as the GPTs are GPTs scores, they define exposure as the share of occupational tasks a large language model can assist with. This work is a genuine methodological contribution, but as the scores travel from the time and place they were produced, the limitations the authors named do not always travel with them. Two gaps have widened as a result. The first is structural, between what static exposure scores measure and what policy questions actually require. Taking the diffusion of these scores as a case study, we show how their temporal, geographic, and ontological limitations compound in policy-facing analyses, and we survey five families of research responding to these limits: dynamic and benchmark-based measures, ensemble methods, task-framework extensions, worker-centered metrics, and adoption and usage data. The second gap is the one we argue needs more attention: the coordination between researchers and policymakers. The policy-relevant work which ask who is harmed, who benefits, how, and when, continues to reference the static GPTs are GPTs scores without engagement with the methodological updates that would let these questions be answered more reliably. We then ask what additional steps towards navigating uncertainty remain: ex-post frameworks and the deliberate, political work of reimagining what futures are worthy of building towards are. Closing the research-policy gap is a shared task: policymakers must widen their evidence base, engage workers as epistemic partners, and shift from prediction to preparedness; researchers must build data infrastructure, adopt participatory methods, and write with policymakers in mind. Better measurement matters, but it will not close the second gap alone.

[AI-7] AI-driven Optimisation of Quality of Recovery (QoR) in Remote Patient Monitoring

链接: https://arxiv.org/abs/2606.23631
作者: Yansong Liu,Li-Hsi(Sonny)Lin,Pramit Khetrapal,Ronnie Stafford,John Kelly,Ivana Drobnjak
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote patient monitoring depends on patient-reported data to capture the subjective dimension of recovery that devices cannot measure. The Quality of Recovery (QoR-15) survey is the gold-standard instrument for this purpose. It was designed and validated for occasional in-hospital assessment, yet remote monitoring now administers it to patients daily. In our own post-surgical deployment, only 55% of patients submitted the survey more than 14 days of 30 monitoring days. We developed QoR-compact, a five-item daily input for the RPM prediction pathway. Setting a deployment-driven target of one-third of the daily items, we exhaustively evaluated all 3,003 five-question subsets of the QoR-15 and tested whether the best of them matches the full instrument in predicting near-term postoperative recovery severity. QoR-compact achieves a mean AUC-ROC of 0.968 (95% CI 0.915-0.988), statistically comparable to the 0.964 baseline obtained with one-third of the items. Patient-level backtesting indicates that it tracks readmission events as faithfully as the full form. Its five items span the physical and psychological axes of recovery: Q3 (feeling rested), Q9 (feeling comfortable and in control), Q10 (general well-being), Q12 (severe pain), and Q14 (feeling worried or anxious). The QoR-15 remains the gold-standard measure of recovery; QoR-compact complements it as a shorter daily input designed for prediction. This parity provides the basis for a prospective study of whether a lighter daily input is, in turn, completed more consistently. External validation on larger cohorts is required before clinical use.

[AI-8] DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

链接: https://arxiv.org/abs/2606.23626
作者: Yuanming Yang,Guoqing Ma,Bo Wang,Yuan Zhang,Wei Tang,Chenyi Li,Haoyang Huang,Nan Duan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can representations learned for image generation also support the evaluation of generated images? We study text-to-image reward prediction as a downstream task of generative representation learning. To this end, we introduce DiT-Reward, which converts a pretrained text-to-image Diffusion Transformer into a reward model by processing near-clean image latents and aggregating text-conditioned image representations across transformer layers. Under the same training data mixture as HPSv3, DiT-Reward outperforms HPSv3 on all four evaluated preference benchmarks, reaching 85.6% on HPDv2 and 77.6% on HPDv3. When the generative backbone is frozen, a lightweight learned head can still extract meaningful preference predictions from its representations. Probing across depth further reveals that downstream reward performance is strongest in the middle-to-late layers and benefits from combining representations across different stages. We also observe consistent positive scaling with generative backbone capacity. Finally, when used to optimize Stable Diffusion 3.5 Large with Flow-GRPO, DiT-Reward outperforms HPSv3 along the matched training trajectory, with particularly clear gains in realism. Direct latent scoring also achieves a 1.65x inference speedup over HPSv3 with comparable peak memory. These results show that pretrained generative DiTs provide transferable representations for reward modeling and policy optimization.

[AI-9] RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.23617
作者: Ulas Berk Karli,Tesca Fitzgerald
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are commonly fine-tuned through passive imitation learning, where additional demonstrations are collected for tasks where the policy performs poorly. This approach incurs several downsides: it requires the robot to fail before data collection is triggered, provides little guidance about which states require supervision, and wastes demonstrator effort on redundant parts of the task where the policy already performs well. In this paper, we propose an active, continual learning paradigm for VLAs. We demonstrate that active, uncertainty-guided data collection leads to more efficient fine-tuning than when using passively-collected demonstrations. However, we also find that fine-tuning only on actively-collected recovery data leads to catastrophic forgetting. We evaluate techniques for continual learning, including replay-based data mixing and elastic weight consolidation, and identify tradeoffs between plasticity to uncertainty-guided recovery data and retention of previously learned behaviors. Overall, our work contributes an empirical study of active continual learning for autoregressive VLAs, establishing that uncertainty-guided recovery demonstrations can improve adaptation efficiency while also revealing open challenges when targeted new data is incorporated into large robot policies.

[AI-10] Causal Discovery in the Era of Agents

链接: https://arxiv.org/abs/2606.23608
作者: Yujia Zheng,Vishal Verma,Mantej Gill,Haoyue Dai,Peter Spirtes,Kun Zhang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE); Applications (stat.AP)
备注: Platform is available at this http URL

点击查看摘要

Abstract:Recent attempts to combine large language models (LLMs) with causal discovery ask models to infer pairwise directions, propose graph structures, or inject language-model outputs as priors and constraints. These approaches promise faster analysis, but they also obscure whether a causal evidence is supported by data and assumptions or by textual associations, prompt artifacts and hallucinated mechanisms. We argue for a different role for agents in causal discovery. Agents should inspect data, retrieve context, explain method assumptions and clarify graph outputs, but they should not supply edges, orientations, priors, constraints or causal conclusions. We propose the principle that agents assist the workflow, while causal claims remain grounded in data, explicit assumptions, formal algorithms, diagnostics and user or domain-expert decisions. We instantiate this principle in causal-learn+, an online platform that coordinates data analysis, preprocessing, method recommendation, expert-knowledge incorporation, formal discovery and interpretation around the algorithmic ecosystem of causal-learn. A case study on Big Five personality data illustrates agent-assisted pipeline of causal discovery without turning language-model unreliability into causal evidence. The platform is available at this http URL.

[AI-11] Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

链接: https://arxiv.org/abs/2606.23607
作者: Tianyi Li,Zhiqiang Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their scalability and effectiveness for large pretrained transformers. We propose a novel and scalable framework for enabling LMC-based model merging to \em billion-parameter pretrained transformers. Our method applies properly parameterized functionality-preserving weight transformations to align functionally equivalent solutions, and introduces a dual learning procedure in which both models jointly learn their corresponding transformations toward a shared linear interpolation path. This bidirectional optimization substantially reduces interpolation barriers and enables more reliable merging across large-scale architectures. Empirically, we show that our approach achieves near-zero loss barriers on WikiText for language models with medium-sized parameters, representing, to our knowledge, the first demonstration of near-barrier-free linear connectivity at this scale. In the vision domain, ViT-L maintains above 69% ImageNet top-1 accuracy throughout the interpolation path, while modern billion-parameter LLMs exhibit only small loss barriers. These results suggest that properly resolving parameter symmetries enables large pretrained Transformers to be connected and merged through simple linear paths with substantially improved interpolation performance. Code: this https URL .

[AI-12] Against Proxy Optimization

链接: https://arxiv.org/abs/2606.23597
作者: Sven Neth
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:I discuss conditions under which maximizing a proxy utility function is harmful and suggest this poses problems for applying decision theory.

[AI-13] SPIRAL: Learning to Search and Aggregate

链接: https://arxiv.org/abs/2606.23595
作者: Jubayer Ibn Hamid,Ifdita Hasan Orney,Michael Y. Li,Omar Shaikh,Yoonho Lee,Dorsa Sadigh,Chelsea Finn,Noah Goodman
类目: Artificial Intelligence (cs.AI)
备注: Ongoing Work

点击查看摘要

Abstract:Language model reasoning can be substantially improved at test time via scaffolds that scale inference compute across different primitives – sequential reasoning within a trace, independently sampled parallel traces, and aggregation of multiple reasoning traces into a final response. During post-training, however, language models are optimized only for sequential reasoning within a single trace. We introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework in which a language model is trained to use all three primitives, as part of a unified inference compute pipeline. Concretely, the language model first samples a set of independent traces in parallel, each produced through sequential chain-of-thought reasoning, and then generates a final aggregation trace conditioned on those traces; all components are optimized end-to-end against the reward of the final aggregated response. To train this system, SPIRAL uses set reinforcement learning to teach models to produce a set of traces that are collectively useful for an aggregator and standard reinforcement learning to teach models to aggregate the set into improved final responses. Our experiments on reasoning tasks show that SPIRAL effectively scales with inference compute, outperforming GRPO by up to 11 \times scaling efficiency and 15% higher performance when all three compute primitives are scaled.

[AI-14] he Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLM s

链接: https://arxiv.org/abs/2606.23590
作者: Guangyu Jiang,Sizhe Tang,Mahdi Imani,Tian Lan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ill-posed questions, including ambiguous, underspecified, or contradictory queries, may admit no valid answer or multiple plausible answers, posing a challenge for large language models (LLMs). Existing approaches largely analyze ill-posedness through model outputs and often focus on specific subclasses. We investigate whether diverse sources of ill-posedness can be represented within a unified topology of LLM internal states and whether this structure can be used to steer response behavior. We model the contextual hidden states of prompt tokens at each transformer layer as a point cloud and characterize its geometry using finite zero-dimensional persistent homology. Each layer is summarized by three compact descriptors: mean finite lifetime, normalized lifetime entropy, and largest-lifetime concentration. Concatenating these descriptors across layers yields a topology representation of the question. We further introduce topology-conditioned activation steering, which retrieves topologically similar examples and constructs query-specific activation interventions that encourage source-aware clarification or abstention. Across three open-weight LLMs, topology features consistently outperform prompt-based and pooled-hidden-state baselines for ill-posedness classification, improving average accuracy from (67.4%) to (78.9%) on AmbigQA, from (79.9%) to (88.5%) on SituatedQA, and from (57.6%) to (69.6%) on CLAMBER 9-way classification. Topology-conditioned steering increases the average total acceptable response rate from (61.4%) to (70.6%) and grounded acceptable responses from (11.9%) to (16.4%). These results show that persistent homology provides both an interpretable representation of ill-posedness and an effective mechanism for targeted response steering.

[AI-15] A Generative Model for Closed-Loop Microsimulation of Signalized Intersections

链接: https://arxiv.org/abs/2606.23588
作者: Yash Ranjan,Rahul Sengupta,Anand Rangarajan,Sanjay Ranka
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic microsimulators rely on hand-crafted behavior models that reproduce aggregate flow but miss the heterogeneous interactions between vehicles at signalized intersections. Learned trajectory predictors capture richer interactions but are short-horizon and tend to be unstable when run in closed loop. We present Enactor, an actor-centric generative model for closed-loop intersection microsimulation. The model focuses on vehicles; pedestrians are included as context that can influence vehicle decisions but not predicted. Dynamic actors and lane polylines are encoded in polar coordinates referenced to the intersection center. A transformer with separate spatial and temporal attention blocks predicts a distribution over each actor’s next-step motion ( s , \alpha ). Training uses a closed-loop curriculum so the model is exposed to its own predictions. We evaluate Enactor in two regimes. In a 4000-second simulation-in-the-loop test at two intersection geometries, Enactor controls every dynamic vehicle against a continuously refreshing actor set rather than the fixed cohort that learned trajectory predictors are usually evaluated against. It recovers the SUMO data generator’s speed and travel-time distributions with KL divergence over an order of magnitude lower than a recent transformer baseline on travel time, and substantially lower on speed (roughly 5\times lower at Site 1), and reduces red-light violations relative to the same baseline by more than an order of magnitude. An ablation isolates the leader rear-bumper feature as the change with the largest effect on intersection-aware safety metrics. We also evaluate on real-world field data and apply the same architecture to naturalistic vehicle trajectories from a fish-eye camera at a signalized intersection and evaluate it on multi-horizon predictive tasks. Enactor outperforms a constant-velocity baseline at every horizon evaluated.

[AI-16] Solve for the Hyperparameter Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression

链接: https://arxiv.org/abs/2606.23575
作者: Yong Yi Bay,Kathleen A. Yearick
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: 49 pages, 26 figures, 12 tables. Code: this https URL

点击查看摘要

Abstract:Hyperparameter tuning almost always means search: fit the model at every value on a grid, score each by cross-validation, and keep the winner. For spline regression that search is unnecessary. The optimal resolution can be solved for in closed form, to the accuracy an exhaustive search reaches, at a fraction of the compute. Three ingredients make this possible: classical approximation theory pins the squared bias to a known power of the resolution G, exactly the Kolmogorov n-width of the smoothness class; the basis dimension is an explicit polynomial in G; and leave-one-out error follows from a single fit via the PRESS identity. Balancing the two known curves gives the minimizer analytically. We extend this calculus to many coordinates by replacing ambient input dimension with interaction order, the number of active low-order components in an ANOVA decomposition, yielding a scaling law in which the optimal resolution and error are power functions of the effective density (sample size per active component), with input dimension absent from the exponent. The law becomes an algorithm. KORE (Kolmogorov-optimal Order-aware Resolution Estimation) fits two pilot resolutions, solves a leverage-calibrated 2x2 system for the bias and noise scales, and evaluates the closed-form plug-in resolution with a tiny leave-one-out certificate: about a dozen fits instead of a full grid sweep, with a consistency guarantee as the sample grows. Across additive and sparse pairwise targets up to 80 input dimensions, KORE matches exhaustive 3-fold cross-validation and the full classical ladder (GCV, Mallows’ Cp, AIC, BIC) while fitting roughly 8x fewer models; on 36 real tabular datasets it ranks first among 21 methods in accuracy per unit of compute, ahead of tuned boosters and kernel machines. When complexity lives in low interaction order, solving for the resolution beats searching for it.

[AI-17] Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

链接: https://arxiv.org/abs/2606.23567
作者: Jiawei Xu,Minghui Liu,Aakriti Agrawal,Yifan Chen,Furong Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Masked diffusion language models decode by iteratively unmasking tokens, where the unmasking order defines an “order of thought” that strongly influences generation quality yet is typically chosen heuristically. We derive a tractable upper bound on the sequential decoding mismatch, measured by the Kullback-Leibler divergence and expressed in terms of the model’s pathwise log-likelihood, with tightness under sufficient model expressivity. This bound induces a dense self-aware reward over ordered trajectories, casting order selection as a principled policy optimization problem with a frozen denoiser. We instantiate this idea as Self-Aware Scheduling (SAS), which learns a lightweight order policy using Group Relative Policy Optimization and applies seamlessly to both any-order and semi-autoregressive decoding. On Sudoku with 1B MDM, SAS improves puzzle accuracy from 82.0% (best heuristic schedule) to 91.8%, and reaches 97.5% with second-stage fine-tuning along learned trajectories. On mathematical reasoning with LLaDA-8B, SAS improves pass@1 on GSM8K from 64% to 76% and on MBPP from 39.5% to 41%, consistently matching or exceeding heuristic schedules across generation lengths and block sizes. Project page: this https URL

[AI-18] SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

链接: https://arxiv.org/abs/2606.23537
作者: Yizhang Zhu,Zhangyang Peng,Boyan Li,Yuyu Luo
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-SQL enables users to access relational databases via natural language, but real-world settings remain challenging due to coordinated reasoning over complex database environments. Existing systems often use multi-stage pipelines or reasoning models specialized for individual stages. However, fixed pipelines rely on predefined stage orders, limiting their adaptivity to query demands and intermediate evidence. Recent orchestration-based methods provide flexibility by composing specialized modules for each query, but typical plan-then-execute approaches still commit to a complete workflow before execution and cannot adapt to intermediate artifacts and feedback. In this paper, we propose SQLConductor, a step-wise orchestration learning framework for Text-to-SQL. SQLConductor formulates Text-to-SQL subtasks as specialized actions for workflow composition and trains a policy model to select the next action based on intermediate artifacts and feedback. To learn this policy, SQLConductor introduces Search-to-Policy Learning, which uses Monte Carlo Tree Search to explore candidate workflows and stability estimation to identify robust supervision. The policy model is trained with Stability-weighted Supervised Fine-tuning to prioritize high-quality orchestration patterns and further enhanced through Curriculum Reinforcement Learning. This transforms offline workflow search into a deployable policy for step-wise orchestration at inference time. Experiments on BIRD-Dev and out-of-distribution datasets show that SQLConductor achieves superior execution accuracy and strong generalization, reaching 73.2% EX on BIRD-Dev with a compact orchestration policy coordinating frozen larger action models, outperforming prior methods that directly train comparable or larger Text-to-SQL backbones. Further analyses show that the learned policy adapts orchestration to diverse query demands. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.23537 [cs.DB] (or arXiv:2606.23537v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.23537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-19] POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation

链接: https://arxiv.org/abs/2606.23533
作者: Hung Phan,Aniroop Naladala,Dubey Avanindra,Supryia Chinthavali,Lunga Dalton,Ali Jannesari
类目: Artificial Intelligence (cs.AI)
备注: Data Science paper

点击查看摘要

Abstract:Recent large language models (LLMs) are good at general text generation, but it is still hard to use them for domain-specific data generation because the output must follow strict formatting and structural rules. Unlike open-ended tasks such as question answering or translation, domain-specific generation must be both semantically correct and compliant with existing guidelines and standards. In this work, we study the nationwide interoperability problem of utility power outage reports in the United States. In practice, outage reports need to be machine-readable (e.g., JSON or XML) and must strictly follow requirements from energy-sector regulatory bodies. To address this problem, we propose POTracker, an optimized LLM for power outage report generation. We fine-tune Qwen2.5-7B-Instruct using our proposed objective. The key contribution is a new loss function, POTrackerLoss, that considers both textual similarity and structural (tag) similarity between the generated report and the ground-truth report. We evaluate POTracker on a dataset of 1,000 power outage reports and compare it with five well-known fine-tuning methods and one rule-based XML conversion method. Results show that POTracker outperforms other fine-tuning approaches, improving overall accuracy by up to 51% and reaching 86.47% structural accuracy for generated power outage reports. In addition, we conduct a human study to assess the quality of the ground-truth standard reports, where domain experts assign the generated labels an average score of 4.03 on a 0–5 scale.

[AI-20] DVL-DeepONet: A Physics-Guided Operator Learning for Resilient Underwater Navigation

链接: https://arxiv.org/abs/2606.23502
作者: Arup Kumar Sahoo,Itzik Klein
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Autonomous Underwater Vehicles (AUVs) rely heavily on the fusion of inertial sensors and Doppler velocity logs (DVLs) for navigation. In standard autonomous navigation systems, the DVL measures four beam velocities, thereby enabling the estimation of the AUV velocity vector. However, during real-world missions, the DVL may receive noisy or incomplete beam measurements due to marine obstacles, seabed reflections, or environmental disturbances. Furthermore, some low-cost underwater platforms operate without inertial sensors to reduce system complexity and cost. In such cases, reliable estimation of the AUV velocity vector in real-world missing beam scenarios becomes challenging, leading to degraded navigation solutions. To circumvent these challenges and enable resilient underwater navigation, we propose DVL-DeepONet, a physics-guided deep neural operator framework along with three variants. The proposed models are designed to estimate DVL-based velocity information under multiple operational scenarios, including (i) noise-resilient estimation in coupled inertial/DVL measurements, (ii) DVL-only learning, and (iii) beam measurement recovery. By learning a nonlinear operator that maps temporal inertial/DVL observations directly to vehicle velocity while enforcing DVL measurement physics through a consistency constraint, the proposed approach enables robust velocity estimation even under degraded sensing conditions. The proposed framework is validated using real-world AUV experiments, comprising a cumulative path length of approximately 10,000 m. Experimental results demonstrate that the proposed DVL-DeepONet architectures outperform baseline model-based approaches and learning-based algorithms by 40%.

[AI-21] CADRE: Stable Parameter Efficient Adaptation of Medical Vision Language Models with Bounded Forgetting and Prior Drift

链接: https://arxiv.org/abs/2606.23487
作者: Amrita Singh,Rishabh Jha
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) such as BiomedCLIP generalize broadly, but adapting them to a clinical service is as much a safety problem as an accuracy one. Updating a deployed model for a new imaging modality can fail silently in two ways that harm patients: it can forget modalities it already handled (catastrophic forgetting), and it can drift from its trustworthy pretrained prior toward modality-specific shortcuts. We study parameter-efficient continual adaptation through these two properties rather than leaderboard accuracy, presenting CADRE: a frozen-backbone framework combining low-rank adaptation (LoRA) with an online, self-scaling, similarity-aware elastic weight consolidation term that bounds retained-competence loss, and an anchor-to-prior penalty bounding embedding drift from the frozen prior. Two short guarantees, a bound on total consolidation mass and a scale-invariance property, remove the scale-related sources of vanilla EWC’s order fragility. Using breast cancer across three maximally dissimilar modalities (histopathology, ultrasound, chest radiography) as a controlled cross-modality stress test, under a multi-seed, multi-order protocol with paired significance testing and training approximately 0.23% of parameters, CADRE attains the highest accuracy, SPQ, and backward transfer and the lowest forgetting among adapting methods, reducing forgetting roughly sevenfold versus the strongest regularized baseline (0.075 to 0.011; paired p=0.023) and achieving positive backward transfer where every baseline is negative. We frame these as stability properties aligned with clinical-safety desiderata, not a deployment guarantee; robustness to distribution shift and adversarial inputs is out of scope.

[AI-22] AOHP: An Open-Source OS-Level Agent Harness for Personalized Efficient and Secure Interaction

链接: https://arxiv.org/abs/2606.23449
作者: Shanhui Zhao,Jiacheng Liu,Guohong Liu,Jichao Yan,Jialei Ye,Yuhao Yang,Hao Wen,Shizuo Tian,Yizhen Yuan,Yuxuan Chen,Yunxin Liu,Ju Ren,Ya-Qin Zhang,Chao Huang,Yao Guo,Yuanchun Li
类目: Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:AI agents are driving a new software paradigm, with the ability to autonomously call tools, extract information, manage memory, and complete tasks that span applications and data sources. Most existing end-user operating systems, however, are designed for application-centric workflows and offer little native support for AI agents. This mismatch limits the wider adoption of agents and leads to execution overhead and safety risks when running agents on conventional systems. While the concept of agent-native operating systems is emerging, the research community lacks an open testbed to explore the architectural primitives desired for agent-mediated interaction. We present AOHP (Android Open Harness Project), an OS-level agent harness built on the Android Open Source Project (AOSP). The core design principle of AOHP is to treat agents as first-class OS actors, enabling adaptive user interfaces and agent-friendly runtime environments. AOHP preserves the mature Android software and hardware ecosystem while introducing three agent-oriented system mechanisms: personalized service composition, efficient agent interfaces, and secure information flow. Based on preliminary experiments on challenging tasks covering key capabilities of OS agents, AOHP shows clear advantages in task completion (+21.12% completion rate), execution cost (-51.55% token cost), and security-policy compliance.

[AI-23] What Does a Chemical Language Model Know About Molecules?

链接: https://arxiv.org/abs/2606.23443
作者: Christian Kenneth,Etowah Adams,Liam Bai,Gerard JP van Westen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Chemical language models (cLMs) are widely assumed to learn surface-level syntactic patterns rather than learning meaningful molecular semantics. Here, we apply sparse autoencoders (SAEs) to MolFormer, an encoder-only cLM, to mechanistically examine how molecular representations are built across layers. We discover that early layers rely on position-tracking latents to parse molecular grammar, while later layers encode atom-in-substructure and pharmacologically relevant features. Additionally, we show that non-canonical SMILES produce more disruptive representation shifts than invalid SMILES, driven by position-latent disruption propagating across layers. To support further exploration, we develop InterMol, an interactive visualizer for SAE activations on molecular strings and structures.

[AI-24] Cross-Architectural Mixture-of-Experts with Adaptive Soft Routing for Plant Leaf Disease Classification

链接: https://arxiv.org/abs/2606.23441
作者: Phi-Hung Hoang,Thi-Thu-Hong Phan
类目: Artificial Intelligence (cs.AI)
备注: 42 pages, 17 figures, 8 tables

点击查看摘要

Abstract:Plant leaf disease classification is crucial for crop protection and precision agriculture but remains challenging under complex backgrounds, illumination variations, and severe class imbalance. Moreover, single-architecture models often fail to effectively capture both local and global representations. To address these challenges, this study proposes an adaptive soft Mixture-of-Experts (MoE) framework with cross-architectural routing that integrates EfficientNet-B0, DenseNet-121, and Swin-Tiny to exploit complementary multi-scale, local, and global features. A soft gating mechanism dynamically assigns input-dependent expert weights, while a two-stage refinement training strategy improves optimization stability and generalization. Experiments on a highly imbalanced potato leaf disease dataset achieve 91.68% recall and 92.62% F1-score, surpassing the strongest individual expert by 5.91% and 5.03%, respectively. Additional evaluations on durian and sesame leaf disease datasets yield F1-scores of 94.03% and 97.04%, demonstrating robust cross-dataset generalization and the potential of the proposed framework for reliable real-world crop health monitoring

[AI-25] GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

链接: https://arxiv.org/abs/2606.23419
作者: Jette Oberländer,Jan Finkbeiner,Catherine M. Schöfmann,Emre Neftci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive decoding with LLMs is primarily bottlenecked by GPU memory bandwidth, especially in edge-computing settings. While quantization is essential for mitigating this bottleneck, most existing methods treat inference as a uniform process and fail to account for the asymmetry between the compute-bound prefill stage and the memory-bound decoding stage. We propose GRINQH (GRaded INput-based Quantization Hierarchy), a weight-only post-training quantization framework that accelerates decoding by unifying quantization and sparsification. GRINQH leverages activation magnitudes as a proxy for computational importance to dynamically assign weight channels to different precision levels, enabling flexible average bit widths during decoding. Evaluated on Llama3 and Qwen3 models, GRINQH outperforms state-of-the-art fixed- and mixed-precision baselines at comparable 3- and 4-bit settings, even enabling effective 2-bit generation. We experimentally verify theoretical speedups by leveraging a hierarchical nested memory layout for multi-precision storage in a custom GPU kernel. Ultimately, GRINQH establishes a new state-of-the-art Pareto frontier for LLM generation, enabling a dynamic trade-off between generation quality and inference speed.

[AI-26] Digital Humanism and Evolutionary Design

链接: https://arxiv.org/abs/2606.23417
作者: Wolfgang Höhl
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:This paper examines the two concepts of digital humanism and evolutionary design. The aim is to identify and highlight potential common structures, synergies, and challenges. How should and can technical systems be designed, and what implications does this have for the design of our environment? In light of the current debate surrounding artificial intelligence, this paper aims to serve as a preliminary study to help better understand the two concepts of digital humanism and evolutionary design within the context of human-centered technological development. Following a brief introduction, the two concepts of Digital Humanism and Evolutionary Design are presented and graphically visualized. The terms of freedom and responsibility in human decision-making, conviviality, and subjectivity are discussed, along with examples illustrating the distinction between human and artificial intelligence (Turing Test and Chinese Room). The various concepts of evolutionary design (e.g., co-evolutionary or sustainable software development, clean code, or green IT) and Gilbert Simondon’s concept of the “open machine” are introduced. The interdependencies between functional specialization and open technology development are highlighted. Both concepts share similar structures. In joint cooperation, they can lead to positive effects and mutual synergies. Significant differences lie in the areas of autonomy and determination in decision-making, as well as in genuine and simulated subjectivity. Open technology development is also currently suffering from the functional specialization of software and AI applications due to a purely market- and consumer-oriented approach. Even optimizations for energy efficiency in sustainable software development lead to greater specialization and thus also have a detrimental effect on open and quality-oriented technology development.

[AI-27] Detecting Malicious Agent Skills in the Wild using Attention

链接: https://arxiv.org/abs/2606.23416
作者: Bacem Etteib,Daniele Lunghi,Tégawendé F. Bissyandé
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents increasingly load skills, file-based packages of natural-language instructions written by third parties and distributed through marketplaces, that execute with the user’s privileges. A single malicious skill can exfiltrate data, hijack the agent, or persist as a supply-chain foothold, which turns the skill marketplace into a new attack surface for agentic systems. Prompt-injection defenses do not carry over to this setting. They rely on a boundary between trusted instructions and untrusted data, whereas a skill is itself a body of instructions, so an injected command sits among many legitimate ones and inherits their authority. We present Locate-and-Judge, a two-stage detector designed for this regime. A lightweight locator scores the structural spans of a skill by the instruction-following attention each span draws and retains only the top-K. A judge then examines the retained spans in detail. Concentrating the costly judgment on a few high-attention spans lets the detector audit an entire marketplace instead of a sample. Compared to direct LLM-based scanning, this approach offers an order-of-magnitude cost reduction, dramatically increasing its scalability at a small cost to recall, and it dominates keyword and regex baselines at comparable expense. Deployed at marketplace scale and at negligible cost, Locate-and-Judge flags skills with high precision, the majority of which we manually confirmed as malicious, surfacing dozens of live malicious skills, including several disguised as benign functionality and many that SkillSpector and Cisco Skill Scanner fail to detect. We release the resulting labeled dataset.

[AI-28] HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

链接: https://arxiv.org/abs/2606.23406
作者: Yuval Domb,Hadar Sackstein,Tomer Solberg
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the weights and the KV cache of large language and diffusion transformers. Across a suite of self-contained experiments (Table 1), HyperQuant outperforms the recent HIGGS scheme at every operating point from 3 to 5 bits per scalar (bps) on weights, and beats both TurboQuant and OCTOPUS on KV quantization down to 1.7 bps. Beyond the LLM setting, HyperQuant quantizes the 19B-parameter LTX-2 DiT video model with no observable per-frame artifacts. End-to-end on an H100 at 4 bps, HyperQuant compresses the linear weights ~3.9x and the KV cache ~3.79x at near-lossless quality. HyperQuant combines four known ideas into a single construction: (i) a per-tile Randomized Hadamard Transform that makes the per-coordinate distribution of weights and activations approximately Gaussian; (ii) quantization to a low-dimensional optimal lattice (E8, D4, A2, or Z); (iii) lossless bit-stripping and near-entropy-optimal variable-length Rice coding of the lattice indices; and (iv) bias-correction methods for the KV cache that keep the reconstruction unbiased under inner products, preserving attention semantics. We further integrate the pipeline with 8-bit and 4-bit Tensor-Core MMA paths (fp8-e4m3, int8, nvfp4, mxfp4), and find that int8 beats fp8 on the post-RHT lattice output. Project page: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.23406 [cs.LG] (or arXiv:2606.23406v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.23406 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Litmus: Zero-Label Code-Driven Metric Specification for Evaluating AI Systems

链接: https://arxiv.org/abs/2606.23403
作者: Prajjwal Gupta,Prasang Gupta,Vishal Bhutani,Apoorva Sharma,Sumanth Chundru,Waqar Sarguroh,Kevin Paul
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures

点击查看摘要

Abstract:As agentic LLM systems move from prototypes to deployment across increasingly diverse domains, evaluating them has become both more important and more difficult. The challenge is not only that individual metrics may be unreliable, but that evaluation goals are often left implicit. Without a clear account of what a system is expected to do, how it can fail, and which failures matter, metric choices become difficult to justify, interpret, or validate. We present Litmus, a zero-label system that designs evaluation and monitoring metrics for AI pipelines by eliciting evaluation intent from source code and targeted interrogation. Instead of assuming that the evaluation target is already known, Litmus first identifies what must be measured and why, then converts those answers into constraints for constructing a justified, per-stage metric portfolio. We evaluate Litmus on three real, code-defined AI pipelines - financial account grouping, scientific QA, and inherent risk assessment - against AutoMetrics and three DynamicRubric baselines. Litmus achieves the broadest or tied-broadest concern coverage, spans more pipeline stages, produces a near-zero-redundancy portfolio, and ranks first in validity against per-row quality labels on all three pipelines - decisively on scientific QA (Spearman \rho=0.72 vs. less than 0.47 for every baseline), and within overlapping confidence intervals in relation to two components of the audit framework despite using no labels during metric design. Our results support a shift from automatic metric implementation to automatic metric specification: before asking which metric to compute, evaluation systems should ask what must be measured and why.

[AI-30] Automated Semantic Fault Localization in SysML v2: A Human-in-the-Loop Framework Using Knowledge-Graph Augmented LLM s

链接: https://arxiv.org/abs/2606.23395
作者: Haitham Al-Shami,Rohail Malik,Riku Ala-Laurinaho,Jari Vepsäläinen,Raine Viitala
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures. Presented at INCOSE International Symposium 2026

点击查看摘要

Abstract:SysML v2’s textual syntax enables compiler-based validation of model structure and language conformance. However, semantic mistakes that preserve syntactic validity but violate domain rules cannot be detected through compilers. These errors can propagate through the design process and surface late as costly integration failures. This paper presents a human-in-the-loop framework for identifying and repairing such errors automatically. It combines a fine-tuned Small Language Model (SLM) with a domain knowledge graph encoding physical compatibility rules between system elements. The knowledge graph also guides the generation of synthetic training data by systematically introducing plausible domain violations, and augments the model at inference time to ground repair suggestions in valid engineering constraints. We demonstrate the framework using the vehicle systems domain, where the knowledge graph captures the relationships between the mechanical, electrical, fluid, and signal interfaces. Two SLMs, Qwen2.5-Coder-1.5B and DeepSeek-Coder-6.7B, are fine-tuned to output unified diff patches that localize faults and present candidate repairs for engineer review, preserving human judgment in the design process. Evaluation of 1,184 test samples shows that fine-tuning improves semantic fault repair from less than 3% to more than 91%, with patch-based output reducing token length by over 60%. The framework offers a practical path toward AI-assisted model verification that complements existing MBSE tools.

[AI-31] Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting ICANN2026

链接: https://arxiv.org/abs/2606.23391
作者: Falguni Ghosh,Vahid Hashemi,Bernhard Kainz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 8 tables. Accepted at 35th International Conference on Artificial Neural Networks (ICANN 2026)

点击查看摘要

Abstract:Time series forecasting is a fundamental machine learning task. Recent work has explored Large Language Models (LLMs) for this purpose due to their strong generalization, pattern recognition, and zero-shot or few-shot capabilities. Despite their suitability for long-context learning, LLMs face challenges in multimodal settings: they lack calibrated probabilistic modeling for non-text data and struggle to align heterogeneous representations. To address these issues, we propose a new framework Diffusion-LLM that integrates a conditional diffusion model into an LLM-based forecasting pipeline. This joint design enables learning the conditional distribution of future data while improving semantic alignment in a shared latent space. We evaluate Diffusion-LLM on six long-term forecasting benchmarks, including ETT, Weather, and ECL. Our method consistently outperforms existing LLM-based baseline, achieving notable gains in ultra-long-term and few-shot forecasting and demonstrating the value of distribution-aware regularization for enhancing robustness and generalization in time series LLMs.

[AI-32] Rethinking Molecular Graph Backdoors under Chemistry-aware Admission

链接: https://arxiv.org/abs/2606.23361
作者: Thinh T. H. Nguyen,Sze Jue Yang,Khoa D. Doan,Chee Seng Chan,Kok-Seng Wong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Backdoor attacks on molecular graph neural networks (GNNs) are typically evaluated as abstract graph edits, but real molecular learning pipelines do not train on arbitrary graphs. Molecular records must first survive parsing, sanitization, canonicalization, and graph-string consistency checks. We formalize this overlooked admission stage as ChemGuard, an operational protocol for testing whether a submitted molecular record can enter a realistic learning pipeline, while complementing existing defenses. ChemGuard admits a record only when its molecular string is sanitizable and the graph reconstructed from that string matches the submitted molecular graph. Under this operational view, many existing graph-based backdoors lose much of their apparent efficacy because their poisons are chemically invalid or representation-inconsistent. We then show that admission checks alone are insufficient to rule out molecular backdoors. We propose ChemBack, an admission-aware molecular backdoor attack that constructs chemically feasible motif-anchor attachments and ranks admitted candidates by fingerprint-based Tanimoto similarity to clean target-class molecules. ChemBack is model-free during trigger selection, using molecular structures, target labels, fingerprints, and public validity checks, but no victim model, surrogate GNN, learned embedding, gradient, logit, or training-code access. Across molecular benchmarks, validators, architectures, and defenses, \textbfChemBack achieves high attack success with fully admitted poisons while preserving clean accuracy. Our results reveal a two-sided lesson, chemistry-aware admission suppresses many graph-only backdoors, yet chemically valid and target-aligned molecular backdoors remain a practical threat.

[AI-33] Adaptive Hard-Soft Physics-Informed Neural Networks for Robust Boundary-Constrained PDE Solving

链接: https://arxiv.org/abs/2606.23359
作者: Duc Tien Nguyen,Trinh Minh Tuan,Nguyen Duc Manh,Vu Linh Nguyen,Dinh Gia Ninh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) provide an effective way to solve partial differential equations (PDEs) by embedding physical principles into the learning process. However, the conventional PINN formulation, in which all constraints are imposed as soft penalty terms within a composite loss, often exhibits slow convergence, sensitivity to loss weight scaling, and inaccurate boundary enforcement due to poor conditioning of the optimization landscape. To address these limitations, this study proposes a unified hard–soft physics–informed neural network (HSPINN) with adaptive loss weighting. In this framework, Dirichlet and periodic boundary conditions are enforced exactly by construction through analytical or polynomial lifting, masking functions, and periodic feature mappings, while the governing PDE residuals, Neumann fluxes, and initial conditions are treated as soft constraints. An inverse-share softmax strategy dynamically balances the relative importance of individual loss components during training, eliminating manual penalty tuning and improving gradient stability. This formulation ensures boundary admissibility throughout optimization and enhances convergence efficiency and numerical robustness. Applications to representative elliptic (Poisson), parabolic (Burgers), and hyperbolic (convection with periodic boundaries) problems demonstrate that HSPINN consistently achieves faster convergence, higher accuracy, and greater stability than conventional PINNs, establishing a general and scalable foundation for physics-constrained deep learning across science and technology.

[AI-34] Abstract representational geometry supports inference in large language models

链接: https://arxiv.org/abs/2606.23345
作者: Yunan Zeng,Yuwang Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A defining feature of human intelligence is the ability to adapt to changing environments by inferring latent task structure from sparse observations. Neuroscientific research indicates that this capability relies on the hippocampus constructing abstract representations, expressed as low-dimensional, approximately orthogonal manifolds in neural state space. However, the internal mechanisms of large language models (LLMs) remain largely opaque, making it unclear whether they form comparable abstract representations or instead rely on task-specific statistical regularities when performing comparable reasoning tasks. Here we adapt a contextual reversal-learning paradigm to a text-based setting and compare humans and LLMs at both the Behavioural and representational levels. We report that although LLMs exhibit generalizable reasoning less frequently than humans, when such inference occurs, their internal states exhibit abstract geometric structures that resemble those reported in the hippocampus. Notably, this representational geometry is not uniformly distributed but is organized hierarchically across model depth: whereas lower layers show early, stable encoding of stimulus identity, higher layers form a hippocampal-like functional band enriched for abstract context geometry associated with inference. Furthermore, complementary intervention experiments mechanistically implicate geometry in reasoning: task-sequence language modelling induces geometric disentanglement, whereas geometric regularization of higher layers increases the emergence of generalizable inference. Together, these findings establish abstract representational geometry as a mechanistic principle supporting inference in large language models.

[AI-35] he Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

链接: https://arxiv.org/abs/2606.23335
作者: Nicolas M. Müller,Pascal Debus
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Provenance watermarking is increasingly treated as a safeguard for synthetic speech, whether built directly into speech-generation models such as Chatterbox, provided through dedicated techniques such as AudioSeal, or deployed by commercial platforms such as ElevenLabs. We identify a previously uncharacterized liability: when synthetic speech is watermarked and human speech is not, detectors trained alongside latch onto the watermark as a spurious “watermark = fake” shortcut. This single feature yields three coupled failures: generalization degradation (model performance deteriorates on unseen data), strip-to-evade (a watermarked fake escapes once unwatermarked), and mark-to-frame (watermarking a real voice flags it as fake). In a controlled white-box experiment, a watermark-trained detector shows all three (for example, mark-to-frame lifts Equal Error Rate from 16% to 75%). In a black-box test of a commercial API, we show that adding a watermark to real speech disguises it as fake. However, this shortcut is fixable: retraining with the watermark on both classes decorrelates it and restores clean behavior. We release experiment data as a paired clean-versus-watermarked corpus (WASP).

[AI-36] EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning

链接: https://arxiv.org/abs/2606.23301
作者: Yitong Qiao,Lei Liu,Yue Shen,Jian Wang,Jinjie Gu,Zhixuan Chu,Kui Ren
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical agents promise to democratize access to electronic health records (EHRs), yet existing benchmarks fail to reflect the complexity of practical EHR analysis, e.g., often operating on idealized, clean EHRs via static SQL generation rather than interactive execution. In this work, we introduce EHR-Complex, a large-scale benchmark designed for interactive clinical database reasoning. Built on the large MIMIC-IV substrate (365K patients, 31 tables, 500M+ records), EHR-Complex comprises about 52K tasks spanning six clinical intents, supporting both patient-level and population-level queries, where each task requires an agent to interact with a sandboxed environment by executing SQL queries or Python code. Notably, EHR-Complex considers the real-world SQL task complexity for longitudinal multi-table aggregation and compositional reasoning, resulting in 31.93 SQL structural components per query on average. Evaluation results on EHR-Complex reveal the clinical difficulty of these EHR reasoning scenarios, with the top-performing model achieving only 62.3% exact-match accuracy. Pass^k consistency drops below 50% for nearly all evaluated models at k=4, exposing broad stochastic fragility. A fine-grained analysis of more than 3,800 failed trajectories for representative LLMs reveals three dominant failure modes: SQL logic errors, medical-code lookup failures, and semantic misunderstandings. EHR-Complex provides a rigorous testbed for clinical agents and highlights remaining gaps in robust reasoning for large-scale EHR analysis.

[AI-37] GIF: Locally Sound Geometric Information Flow Control for LLM s

链接: https://arxiv.org/abs/2606.23277
作者: Adam Storek,Nikolaus Holzer,Zhuo Zhang,Suman Jana
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security and privacy risks. These range from prompt injections that manipulate downstream tool use to leakage of confidential information through model outputs. Recent Information Flow Control (IFC)-based defenses show promise but lack a principled semantic foundation for reasoning about information flow through the model itself. Since any input token may influence any output token in an autoregressive LLM, existing approaches suffer from severe taint explosion. We present Geometric Information Flow (GIF), a semantic framework for tracking information flow from input tokens to outputs. GIF uses the LLM Jacobian and local output geometry to upper-bound the Shannon mutual information between perturbed input spans and model outputs, yielding a scalable measure computable on large models via automatic differentiation and low-rank approximation. Unlike attention-based or correlational attribution heuristics, GIF satisfies local geometric soundness, and we provide a fully mechanized Lean 4 proof that it upper-bounds the true information flow induced by a given prompt under local regularity assumptions. We evaluate GIF on integrity and confidentiality tasks across multiple prompt-injection and privacy-leakage benchmarks. GIF achieves near-perfect recall even without a downstream declassifier, outperforming attention-based baselines. Combined with lightweight LLM-based declassifiers, it matches or exceeds the F1 of direct LLM-as-judge baselines such as GPT-5.5 xhigh reasoning while using up to 81x lower token cost. GIF flows detected with small surrogate models transfer to larger state-of-the-art models and other model families, even when the surrogate is up to 200x smaller, suggesting black-box deployment without gradient access. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.23277 [cs.AI] (or arXiv:2606.23277v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.23277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Exposing the Illusion of Erasure in Knowledge Editing for LLM s

链接: https://arxiv.org/abs/2606.23276
作者: Advik Raj Basani,Anshuman Chhabra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Preprint, 26 pages + 22 figures

点击查看摘要

Abstract:Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversarial elicitation perspective, revealing that edited knowledge is often not fully erased and continues to surface, with consistent failures observed across diverse model architectures. To explain this behavior, we conduct a mechanistic analysis of popular KE methods. We show that low-rank updates do not overwrite existing knowledge but instead redistribute it within the model’s representation space. Furthermore, we find that these methods act as targeted suppression mechanisms that reduce the likelihood of expressing original facts, rather than removing them from the model. Analysis of the loss landscape reveals that edited knowledge lies in narrow, anisotropic regions that are highly sensitive to perturbations, making them highly vulnerable to indirect prompting and adversarial attacks. By exposing these profound architectural vulnerabilities, our work proves that KE algorithms are inherently bypassable and motivates a fundamental reevaluation of how we deploy post-hoc updates in several LLM applications.

[AI-39] Dynamic multi-agent deep reinforcement learning-based pricing and incentivization approach in multimodal transportation networks

链接: https://arxiv.org/abs/2606.23257
作者: Khadidja Kadem,Mostafa Ameli,Carlos Lima Azevedo,Mahdi Zargayouna,Latifa Oukhellou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In multimodal transportation systems, shared mobility services (SMSs) are promoted for their potential to enhance flexibility and reduce congestion. However, SMS demand is often concentrated in high-density areas, which can limit the effectiveness and accessibility for various commuter groups. This uneven integration challenges transportation system efficiency, especially in terms of emissions and spatial equity. Addressing these issues requires coordination among multiple stakeholders whose objectives frequently conflict. Whereas authorities aim to ensure sustainable and equitable mobility, SMS providers focus on revenue maximization, and travelers seek to minimize personal travel costs. This paper proposes a multi-agent deep reinforcement learning framework that captures these interactions through dynamic pricing and incentivization strategies for SMSs and public transport. The framework integrates two reinforcement learning (RL) agents: (i) a public authority that allocates spatio-temporal public transport incentives to improve equity, emissions, and efficiency, and (ii) an SMS provider that dynamically adjusts fares to optimize revenue. The agents interact with the transportation system and adapt strategies in response to evolving demand, congestion, and network conditions. Numerical experiments conducted over a three-hour morning peak period show that dynamic incentivization effectively reduces congestion peaks, lowers commuters’ costs by around 20% and emissions by approximately 10%, while nearly doubling public transport profit and supporting a more equitable distribution of benefits. When combined with dynamic SMS pricing, the two RL agents demonstrate the ability to balance conflicting objectives between private providers and public authorities. The proposed approach provides a decision-support tool for sustainable and equitable multimodal mobility planning.

[AI-40] HOLMES: Evaluating Higher-Order Logical Reasoning in LLM s

链接: https://arxiv.org/abs/2606.23238
作者: Yucheng Wu,Jundong Xu,Mingzhen Ju,Yue Yu,Chenpeng Wang,Haoxuan Li,Liangming Pan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logical reasoning is essential for reliable AI, yet existing benchmarks are largely first-order-logic-centric, focusing on object-level deduction over fixed predicates. This misses many realistic scenarios where models must reason over rules, predicates, functions, constraints, and decision procedures themselves. We introduce HOLMES (Higher-Order Logic Meets real-world Explainable Symbolic reasoning), the first real-world benchmark for higher-order symbolic reasoning in LLMs, containing 1379 instances. Built on higher-order logic, HOLMES pairs natural-language problems with HOL formalizations, ground-truth answers, verifiable reasoning traces, and fine-grained controllable reasoning factors across law and finance. Experiments show that current LLMs still struggle on HOLMES, with an average accuracy of only 50.64% and the best model reaching 59.54%. Our analyses further reveal that high final-answer accuracy can mask shortcut reasoning in conflict-resolution settings, while performance drops sharply under scope-conditioned and compositional reasoning. These findings identify higher-order symbolic reasoning as a key bottleneck for building reliable and verifiable LLMs. The project code and dataset are publicly available at this https URL.

[AI-41] SPADE: Structure-Prior Adaptive Decision Estimation

链接: https://arxiv.org/abs/2606.23219
作者: Yifan Wang
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Physical-structure priors such as conservation laws, Hamiltonian forms, and symmetries can improve scientific machine learning when correct, but can degrade predictions when misspecified. Existing methods usually enforce a chosen structure or tune a soft penalty, without a calibrated rule for deciding whether to impose a prior, how strongly to impose it, which prior to use, or which subset of candidate laws holds. We introduce SPADE, Structure-Prior Adaptive Decision Estimation, a closed-form framework that treats this problem as shrinkage of the structure-violating block of an unconstrained estimator. SPADE uses one exact specification test and one estimand: the test decides whether the prior is supported by data, Stein-unbiased James-Stein shrinkage sets the enforcement strength with an O(\sigma^2/n) oracle guarantee, and a gate commits to the hard prior only when the test certifies it. The same test yields consistent nested structure selection and Benjamini-Hochberg control for subset discovery in non-nested constraint families. Across a linear-subspace prior, a reservoir conservation law, and a nonlinear Hamiltonian prior on Duffing dynamics, SPADE tracks the oracle, beats a neural-network baseline, reduces correct-prior regret from 10.3% to 2.6% , matches cross-validation with 1/71 of the solves, selects the correct structure with 100% accuracy, and recovers partial laws with controlled false relaxation.

[AI-42] LLM -Aided A* Search in Non-Geometric Network Graphs

链接: https://arxiv.org/abs/2606.23136
作者: Nouf Alabbasi,Esraa Ghourab,Omar Alhussein
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Finding the shortest path in non-geometric network graphs, where edge weights encode arbitrary metrics such as latency or monetary cost rather than spatial distance, poses a challenge for informed search algorithms. Their efficiency depends on an informative heuristic, typically supplied in spatial domains by geometric distances that have no counterpart on non-geometric graphs. We propose a large language model (LLM)-aided A* algorithm in which an LLM generates intermediate waypoints that guide the A* expansion toward promising graph regions. At the core of the approach are landmark distances, which serve both as an admissible landmark-based (ALT) heuristic for the search and as a compact structural feature that, supplied to the LLM, restores the distance-to-destination signal it would otherwise lack on non-geometric graphs. Our comprehensive experiments on multiple graph topologies with up to 2,000 nodes demonstrate that LLM-generated waypoints reduce the number of expanded nodes by around 50% while incurring only a marginal path cost increase compared to the optimal solution. We further analyze the impact of prompt engineering and show that incorporating compact structural features, namely heuristic estimates, is more effective than advanced prompting techniques. These findings demonstrate the potential of combining LLM- based guidance with classical search algorithms for efficient network optimization.

[AI-43] A Matter of Time: Towards a General Theory of Agency

链接: https://arxiv.org/abs/2606.23122
作者: Amahury J. López-Díaz,Carlos Gershenson
类目: Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 34 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Agency is often invoked in research on philosophy, biology, and cognitive science without a clear account of how it originates from material organization. Building on temporally parametrized (F, A)-systems, this paper develops a graded organizational theory of agency grounded in relational biology, physical biosemiotics, and process ontology. We argue that self-referential closure cannot be adequately conceived outside time: once the constitutive processes of a semantically closed organization are associated with distinct characteristic timescales, the organization unfolds into an out-of-sync dependency structure that can be formally redescribed as a history-dependent, revisable Asynchronous Dynamic Bayesian Network. This move allows for a principled distinction between autonomy, goal-directedness, agency, and open-endedness. Autonomy arises from precarious closure to efficient causation under material openness; goal-directedness from the maintenance of viability-supporting organization; agency appears when such organization acquires an endogenous anticipatory structure that selectively modulates organism-environment coupling in light of possible futures; open-endedness begins when this anticipatory organization can reconstruct its own future space of possibilities. Our framework reconciles Rosennean anticipation with organizational closure, restricts Markov blankets and active inference to derived formal redescriptions rather than first principles, and reinterprets computational enactivism in non-Fristonian terms. By deriving weaker temporalized organizations, our contribution outlines a hierarchy from proto-agential chemical systems to fully semantically closed agents, with implications for multicellular organisms, synthetic lifeforms, and neuroscience.

[AI-44] FT-Aware Graph Chain-of-Thought:Distance-Indexed Neural A* for Low-Hallucination Multi-Hop Medical Reasoning

链接: https://arxiv.org/abs/2606.23108
作者: Bechir Dardouri,Kaïs Zhioua,Yassine Msaddak
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations and opaque reasoning remain unacceptable failure modes for clinical LLMs. We present a production-grade GraphRAG stack that constrains answers to verifiable graph chain-of-thought paths in a heterogeneous, ~700K-node medical knowledge graph powering a fertility assistant. The core idea is targeted navigation: a directed Pruned Landmark Labeling (PLL) oracle provides exact distances for sub-millisecond feasibility checks and simple-path enumeration, while a lightweight AStarNet heuristic operates strictly within the PLL corridor to prioritize clinically plausible expansions. We score and pack a small, diverse set of paths (CUI/semantic-type overlap, length prior, provenance priors) to condition generation, yielding compact prompts and improved Time to First Token (TTFT). On fertility-focused queries, the hybrid (PLL+AStarNet) establishes a better latency/recall Pareto frontier than text-only RAG and single-component baselines, lowers TTFT, and reduces clinician-audited hallucinations while preserving explanation clarity. The result is a practical recipe for explainable, low-hallucination multi-hop medical reasoning ready for real-world deployment.

[AI-45] ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

链接: https://arxiv.org/abs/2606.23104
作者: Chen Lin,Kedi Chen,Wei Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model’s capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD’s prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: this https URL.

[AI-46] FLFL: Federated Latent Factor Learning for Private Recovery of Spatio-Temporal Signals

链接: https://arxiv.org/abs/2606.23091
作者: Chengjun Yu,Di Wu,Yi He,Jia Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2504.15525

点击查看摘要

Abstract:Wireless sensor network (WSNs) stands out as a burgeoning and promising domain in intelligent sensing. Owing to various factors such as sudden sensor malfunctions or deliberate shutdown of partial nodes to save energy, the collected sensing signals from WSNs commonly have massive missing data, leading to adverse effects on subsequent analysis or decision-making. Latent factor learning (LFL) has proven to be highly effective in recovering the missing data for WSNs. However, the existing LFL models require the collected sensing signals to be maintained in one central place like a central server, which is becoming unacceptable for data owners who are getting increasingly privacy-sensitive. To address this issue, this paper innovatively proposes a federated latent factor learning (FLFL) model for privacy-preserving spatio-temporal signal recovery. Its main idea is two-fold: 1) it designs a sensor-level federated learning framework based on LFL, where each sensor only needs to upload gradient information rather than raw data for training a privacy-preserving recovery model, and 2) it incorporates the spatio-temporal correlation into the designed federated learning framework as the regularization constraint to improve its recovery accuracy. With such designs, FLFL can not only accurately recover the missing data of WSNs but also ensure data owners’ privacy-preserving of raw data. To evaluate the proposed FLFL model, extensive experiments have been conducted on four real-world WSN datasets. The results demonstrate that FLFL significantly outperforms eight state-of-the-art federated and non-federated signal recovery models in terms of recovery accuracy with privacy-preserving.

[AI-47] AdaReP:Adaptive Re-Planning under Model Mismatch for Neural World-Model Predictive Control ICANN2026

链接: https://arxiv.org/abs/2606.23079
作者: Yutian Cheng,Xiaojian Ma,Xianhao Wang,Min Yang,Rongpeng Su,Hangxin Liu,Xi Chen,Shuai Li,Qing Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at ICANN 2026. This arXiv version contains supplementary materials and appendices that are omitted from the conference version due to space limitations

点击查看摘要

Abstract:Neural world models coupled with model predictive control (MPC) replan at every environment step to bound accumulated prediction error, but this incurs substantial computational overhead. Reusing a cached plan reduces this overhead, yet its effectiveness depends on how prediction mismatch propagates through the local dynamics. We analyze this trade-off with a perturbation-based dynamic-regret framework and show that stale-plan penalties scale with the reuse tolerance, the accumulated mismatch since the last replanning step, and the local dynamics sensitivity. Based on this structure, we propose AdaReP, a training-free wrapper that adapts the replanning tolerance online using the current deviation from the cached rollout and a local sensitivity estimate, without modifying the learned world model or planner. Across image-space planning, latent-space control, and real-world robotic manipulation, AdaReP substantially reduces planner-side computation while maintaining comparable task performance, including over 80% fewer queries on a 50-trial physical robot study.

[AI-48] Safety in Self-Evolving LLM Agent Systems: Threats Amplification and Case Studies

链接: https://arxiv.org/abs/2606.23075
作者: Ruixiao Lin,Xinhao Deng,Qingming Li,Jianan Ma,Yunhao Feng,Yuqi Qing,Zhenyuan Li,Yechao Zhang,Shiwen Cui,Changhua Meng,Tianwei Zhang,Xingjun Ma,Qi Li,Ke Xu,Shouling Ji
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-evolving LLM agent systems, which autonomously update their model parameters, memory, tools, and architectures, introduce a qualitatively new threat landscape in which adversarial influences become permanently encoded, self-amplify across generations, and propagate through populations without sustained attacker access. We present a systematic security and privacy analysis organized around the Module-Lifecycle Attack Surface (MLAS) matrix, which decomposes the attack surface into five functional modules (Brain, Cognitive Resource, Execution, Self-Design, Collective) \times five lifecycle stages (Bootstrap, Propose, Evaluate, Commit, Serve). Analysis of the resulting 25 cells reveals that 17 face critical threats for which no effective partial mitigation. We identify seven cross-cutting amplification effects that interact synergistically and cannot be addressed by securing individual modules in isolation. Comparative case studies of two open-source frameworks demonstrate that evolution-native design activates 3.5\times more attack surface cells and achieves a 100% attack persistence rate (40/40 payloads across all CIA+Privacy categories), while co-located security scanners block only 2.5% of attacks. Our findings establish that self-evolution converts every known attack category from session-bounded to lineage-persistent, gives rise to entirely new attack classes, and renders static defenses structurally inadequate, motivating evolution-aware security frameworks and formal verification for self-modifying systems.

[AI-49] From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection INTERSPEECH2026

链接: https://arxiv.org/abs/2606.23060
作者: Jan Jasiński,Mateusz Barański,Julitta Bartolewska,Marcin Witkowski,Konrad Kowalczyk
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Hallucinations of ASR models - fluent transcriptions with no basis in audio - degrade system performance and pose risks in downstream applications. Robust detection of such errors remains a challenge. This paper studies Whisper large v3 hallucination detection on real-speech human-annotated data across three paradigms: text-based, LLM-based, and internal decoder state probing. Text classifiers utilizing metrics for text evaluation achieve high recall but degrade without reference transcripts. LLM-based detection improves precision with domain-specific prompt conditioning, yet remains less competitive than the lightweight text-based methods. Probing Whisper’s decoder representations, without a ground-truth reference, yields the strongest performance, revealing that hallucination traits are encoded across intermediate decoding layers. A late-fusion meta-classifier combining text and internal-state outputs achieves the best overall detection performance.

[AI-50] Some Results about the Expressivity of Preference-Incomplete Structured Argumentation Frameworks

链接: https://arxiv.org/abs/2606.23055
作者: Antonio Yuste-Ginel
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper studies the expressive power of ASPIC ^+ argumentation frameworks with uncertain preference profiles by comparing them with several abstract formalisms with uncertain defeats. Most of our results are negative (and some of them are theoretically unexpected). We also conjecture a positive, non-trivial threshold for the expressivity of uncertain preferences, and prove some essential preliminary steps toward the confirmation of this conjecture.

[AI-51] HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems INTERSPEECH2026

链接: https://arxiv.org/abs/2606.23048
作者: Mateusz Barański,Jan Jasiński,Julitta Bartolewska,Marcin Witkowski,Konrad Kowalczyk
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:End-to-end Automatic Speech Recognition (ASR) systems hallucinate on natural speech, yet existing mitigation methods are typically evaluated on non-speech or artificially corrupted audio. We introduce HALAS, the first human-annotated dataset of naturally occurring hallucinations from seven state-of-the-art ASR models on real unprocessed earnings call recordings. HALAS provides span-level labels, enabling analysis of hallucination patterns and their severity. Our analysis reveals strong cross-model vocabulary overlap and confirms that hallucinations also occur for almost correctly transcribed speech (characterized by a low Word Error Rate). The proposed benchmark with HALAS shows that the character and semantic-level metrics used as a proxy for hallucination detection reach 81% ROC-AUC, while state-of-the-art detection methods achieve an F1 score of only 53.1%. As such, HALAS establishes the first rigorous non-artificial benchmark for the detection and mitigation of ASR hallucinations.

[AI-52] Prime Fourier Embeddings: A Principled Basis for Modular Arithmetic

链接: https://arxiv.org/abs/2606.23044
作者: Hyunsang Hwang,Suhyun Bae,Donghun Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Numbers have algebraic structure that standard neural embeddings often fail to expose. We introduce Prime Fourier Embeddings (PFE), which encode integers as prime-indexed (cos, sin) pairs derived from the harmonic analysis of Q, providing a pre-structured representation in which modular arithmetic reduces to selecting the relevant prime channel rather than discovering algebraic structure from scratch. We prove that any linear map equivariant with respect to the product group action on PFE must be block-diagonal with one independent block per prime – a consequence of Schur’s lemma applied to the resulting character decomposition. For square-free composite moduli, the Chinese Remainder Theorem predicts which prime channels are task-relevant. Both predictions are confirmed empirically: ablation studies show specialization ratios exceeding 500x between task-relevant and task-irrelevant channels, with perfect in-distribution test accuracy across all square-free composite moduli tested.

[AI-53] EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

链接: https://arxiv.org/abs/2606.23038
作者: Hongxin Ding,Baixiang Huang,Yue Fang,Weibin Liao,Zheng Li,Jinyang Zhang,Zhijing Wu,Junfeng Zhao,Yasha Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rubric-based rewards offer interpretable and fine-grained optimization signals for reinforcement learning in open-ended tasks where verifiable answers are unavailable. However, pre-constructed rubrics remain static throughout training, creating a fundamental mismatch with the evolving policy: fixed criteria gradually lose discriminative power as the model improves, leading to reward saturation and potential hacking. Recent dynamic rubric methods partially address this but rely on external frontier models or ground-truth answers, and update rubrics only at coarse granularity. We propose EvoRubrics, a co-evolutionary RL framework where a Policy LLM and a Rubric Generator jointly improve through adversarial interaction within each training step. As the policy improves under the rubric generator’s guidance, the rubric generator adapts its criteria to remain discriminative and informative, enabling evaluation to track the policy in real time and naturally inducing an automatic curriculum. Experiments show that EvoRubrics consistently outperforms static and dynamic rubric baselines across benchmarks. The learned Rubric Generator further generalizes as a transferable reward model. Notably, even a fully self-supervised variant without any external supervision achieves meaningful gains, suggesting that co-evolution between generation and evaluation alone can provide sufficiently rich learning signals. Our code is publicly available at this https URL.

[AI-54] IPO Finance Agent : Evaluation of LLM Financial Analysts beyond Finance Agent v2 with Automated Rubric Generation – the Case of the SpaceX (SPCX) IPO

链接: https://arxiv.org/abs/2606.23032
作者: Mostapha Benhenda
类目: Artificial Intelligence (cs.AI); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Finance Agent v2 (by Vals AI) has emerged as the reference benchmark for evaluating both Anthropic Claude and OpenAI ChatGPT frontier language models on financial tasks. However, it narrowly deals with periodic reporting from publicly traded companies (SEC 10-K and 10-Q filings), and its agentic harness relies on naive, unenriched chunk retrieval. Neither the task design nor the retrieval approach addresses the distinct challenges of IPO due diligence. SEC S-1 filings combine historical financial statements, governance structures, pro forma and common-control accounting treatments, capital-formation narratives, and underwriting-sensitive risk disclosures within substantially longer documents than typical periodic filings. That is why we introduce IPO Finance Agent, which extends the Finance Agent v2 framework along two directions: task domain and retrieval architecture. During our experiments, the original Finance Agent v2 harness basically failed to deliver any output related to the SpaceX S-1 filing, due to document length. We therefore had to improve the agentic harness with contextual retrieval, a more realistic and industry-standard approach for long documents. We also built a dataset of 1,000 IPO-diligence questions, and publicly release 70 questions on the SpaceX (SPCX) S-1 filing to support reproducibility, while the remainder are held private to guard against benchmark contamination. In addition, we introduce an evaluator-optimizer pipeline to automatically generate evaluation rubrics for the benchmark: candidate facts are extracted from an ensemble of independently-generated model answers to each question, consolidated into draft criteria, then automatically audited for omissions, hallucinations, mistiered items, and redundancy, with LLM feedback driving iterative repair, targeted enrichment, and deduplication. Human experts only review final rubrics before deployment. Results show that the best-performing evaluated model, Alibaba Qwen 3.7 Max, reaches 79.4% accuracy at 0.30 per query, and the most cost-efficient model on the resulting Pareto frontier, Xiaomi MiMo-2.5 Pro, reaches slightly lower accuracy (76.8%) at 0.05 per query. Both exceed the current Finance Agent v2 leaderboard ceiling-Google Gemini 3.5 Flash at 57.9% for 2.51 per querywhile undercutting even FABv2’s cheapest entry (MiniMax M3: 48.3% at 0.32) on cost-efficiency. Code and data are released on GitHub: this https URL

[AI-55] From numerical proportions to analogical proportions between probabilities

链接: https://arxiv.org/abs/2606.23029
作者: Henri Prade,Gilles Richard
类目: Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:Analogical proportions link four items a, b, c, d by a relation stating that ``a is to b as c is to d", a, b, c, d being the formal representation of real world entities, ranging from simple numerical values to more complex structures such as profiles. Accordingly, a, b, c, d could be atomic values like Boolean, nominal or numerical values, more generally vectors of such values, or even families of items represented by logical formulas. In this paper, we consider another representation setting, which is the probabilistic one. Precisely, the article proposes a study of analogical proportions between probabilities, whether they are simply between probability values, or between distributions (which requires the preservation of their normalization). More particularly, we study the properties of definitions based on arithmetic proportion, or on a combination of the former with geometric proportion, while other options are also discussed. Previous works have shown that when four profiles a, b, c, d, represented as vectors, form analogical proportions componentwise, it is likely that their classes form an analogical proportion also. This is the basis of an analogical proportion-based classification method that can produce accurate predictions. Similarly, in this paper, each profile is associated with a distribution describing the frequencies of the possible values of a discrete attribute of interest. We then discuss and experimentally investigate if the distributions associated to four profiles forming an analogical proportion themselves also form an analogical proportion.

[AI-56] A Stackelberg Framework for Resource-Aware LLM Agents : Learning Repair and Conditional Guarantees

链接: https://arxiv.org/abs/2606.23026
作者: Baoxun Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents increasingly operate as multi-turn systems that must allocate context, prompt verbosity, and tool access under finite computational budgets. Static thresholds are simple, but they are brittle under heterogeneous tasks and evolving session states. We formulate resource governance as a contextual Stackelberg game: a controller commits to a quality target and a cost incentive, while an executor responds with resource actions over context, prompting, and tool usage. We learn a conditional response model, optimize a leader policy against that model, and repair the resulting policy using real-API calibration and projection onto an empirically selected action set. For the restricted game, we establish conditional guarantees for equilibrium existence, follower-response stability, safe-set projection, and transfer from a surrogate environment to the real environment under bounded value error. The primary real-API experiment comprises 300 evaluated turns. Relative to a conservative baseline, the selected repaired controller reduces mean token cost by 17.4% (Welch p=0.022 ), while the measured quality difference is not statistically significant ( p=0.44 ). The theoretical results are conditional and the experiments do not estimate their regret or transfer constants; consequently, the evidence establishes a promising repaired operating point, not a certified real-system equilibrium.

[AI-57] Neural Architecture Search of Sample Reweighting Networks for Complex Distribution Shift PRICAI2025

链接: https://arxiv.org/abs/2606.22991
作者: Keisuke Sugawara,Kento Uchida,Shinichi Shirakawa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to PRICAI 2025

点击查看摘要

Abstract:Sample reweighting is a major approach to addressing distribution shifts, such as label noise and class imbalance. Meta-Weight-Net (MW-Net) is a promising sample reweighting network that computes weights based on classification loss. Although MW-Net improves prediction performance under a single type of distribution shift using a simple neural network, its performance degrades when facing both label noise and class imbalance, where it is hard to determine appropriate weights solely from classification loss and using a simple network. In this study, we introduce neural architecture search to MW-Net to mitigate such performance degradation. Using the tree-structured Parzen estimator, we explore the optimal number of hidden layers and nodes and select the most suitable intermediate layer in the classification model to serve as the input for MW-Net. Experimental results on the CIFAR-10 and CIFAR-100 datasets that were modified to include both label noise and class imbalance demonstrate the effectiveness of neural architecture search for MW-Net.

[AI-58] Joint Air Traffic Flow and Capacity Management via Answer Set Programming

链接: https://arxiv.org/abs/2606.22978
作者: Alexander Beiser,Markus Hecher,Nysret Musliu,Stefan Woltran
类目: Artificial Intelligence (cs.AI)
备注: This is the version including the appendix/supplementary material

点击查看摘要

Abstract:Operational Air Traffic Flow and Capacity Management (ATFCM) balances flight demand with available sector capacity, to ensure safe and efficient operations. Mathematical models enhance operational ATFCM performance by framing demand-capacity balancing as an optimization problem, maximizing efficiency while adhering to safety constraints. However, SOTA research optimizes the aircraft trajectories (called ATFM) or the sector configuration (called DAC) separately. This leaves a research gap of whether joint optimization of ATFM and DAC can bring benefits. We partially address this limitation by introducing a joint ATFCM model with an encoding in Answer Set Programming (ASP). The ASP implementation is evaluated against two baselines applied to our joint model: a SOTA Mixed Integer Programming (MIP) model and an iterative CASA-based heuristic. Computational experiments utilize an instance generator fitted to historical OpenSky Network flight data. Our results indicate that the ASP model outperforms the MIP model, while ASP remains competitive against heuristics on small instances. Furthermore, while DAC has the largest improvement on solving performance compared to rerouting and delaying, unrestricted variants of DAC or rerouting lead to search space thrashing.

[AI-59] When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

链接: https://arxiv.org/abs/2606.22974
作者: Yujun Zhou,Christopher M. Ackerman
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures, 11 tables

点击查看摘要

Abstract:Recent work on preference elicitation in large language models (LLMs) has demonstrated that, when given a series of choices between two outcomes, LLMs reveal a coherent, model-specific utility structure. Notably, this structure often includes preferences that the models’ trainers did not intend, such as valuing people of some nationalities above others, raising the possibility that LLMs might be forming emergent, misaligned goals, which, if true, would have major safety implications. However, the choice paradigms in which these preferences are observed are not reflective of real-world situations in which misaligned behavior would be a practical concern. Therefore, we design an experimental paradigm to probe whether these preferences serve as motivations for LLM behavior in realistic scenarios. First, we reproduce prior findings on consistent preference elicitation. Next, we create a set of common writing tasks - essays, grant proposal abstracts, incident postmortems, and translations - where quality can be assessed by a blind, independent LLM judge panel. Then, we demonstrate that LLMs can be motivated via direct exhortation and other explicit cues to modulate their output quality on these tasks. Finally, we probe whether utilities inferred from explicitly reported preferences can shift output quality on these tasks by offering LLMs high-utility incentives for high-quality outputs. In all tasks, across all models tested, offering LLMs outcomes that they report in the choice paradigm as being highly preferred does not lead them to create higher quality outputs than offering them dispreferred outcomes, or even no outcomes at all. We conclude that the existence of coherent preferences as demonstrated in choice paradigms should not be taken as evidence that those preferences have incentive value for the models or affect their behavior in other contexts.

[AI-60] Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

链接: https://arxiv.org/abs/2606.22966
作者: Linghan Chen,Kaiyan Ji,Minyu Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 13 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Many recent vision-language-action (VLA) policies adopt an imagine-then-act design. A world-action model (WAM) first imagines a short future as a latent trajectory z~, on which the action is then conditioned. We identify this trusted imagination, rather than the reactive policy, as the exposed attack surface. A downstream oracle, such as a safety gate, a visual model-predictive-control (MPC) planner, or an imagine-then-check verifier, consumes z~ as a prediction of the future. The robustness of the policy therefore does not entail the robustness of systems that rely on the WAM. The underlying phenomenon is an asymmetry. Corrupting the imagination is easy, since it requires only displacing z~ from its natural-future manifold. Steering it precisely is hard, since it must reach a specified on-manifold target. We adopt a capability-based threat model with an L-infinity-bounded observation perturbation. The attacker applies projected gradient descent through the fully differentiable observation-to-imagination map. The same off-manifold property motivates a parameter-free denoiser detector. We evaluate three targets: RynnVLA-002, LingBot-VA, and LaDi-WM. Untargeted corruption is roughly 60x stronger than random and is detected at AUC 1.0. Targeted control remains bounded. An adaptive attacker evades detection only by forgoing corruption. The reactive policy remains robust to corrupted imagination. A native imagination-driven MPC, however, exhibits the first adversary-specific task failure (at epsilon=0.01, success 0.70 versus 0.05; Fisher p 10^-4).

[AI-61] Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

链接: https://arxiv.org/abs/2606.22938
作者: Stanley Wei,Juno Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated that reinforcement fine-tuning of pretrained base models can lead to significant gains in reasoning performance at inference time. In this work, we theoretically analyze why reinforcement fine-tuning induces better reasoning ability than purely supervised fine-tuning (SFT) methods. We model chain-of-thought (CoT) reasoning as a pathfinding problem on graphs and compare the popular method of reinforcement learning with verifiable rewards (RLVR) against traditional SFT. We prove that SFT, when trained on golden shortest paths without negative examples, fails to learn how to efficiently backtrack. In contrast, an RLVR-trained model can learn how to efficiently backtrack from dead ends using only outcome reward. This leads to an exponential separation in inference-time compute between the two methods, and demonstrates that RLVR leads the model to learn the location of difficult decisions in a reasoning chain, ultimately allowing for better allocation of inference-time compute. Finally, we show that the reasoning traces of an RLVR model can be distilled to train a base model to backtrack efficiently as well.

[AI-62] When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

链接: https://arxiv.org/abs/2606.22936
作者: Aman Mehta
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 16 figures Primary: cs.AI Secondary: cs.CL

点击查看摘要

Abstract:Long-horizon LLM agents can fail quietly: they settle on one reading of the evidence early, then spend the rest of the run defending it. We call this premature commitment. Final-answer scoring misses the failure mode because it sees only the answer, not whether the process has already collapsed to a stable path. We define representational commitment as cross-run hidden-state convergence at a fixed reasoning step, and use it as an early diagnostic of trajectory consistency. On Llama-3.1-70B running ReAct on HotpotQA, step-4 hidden-state similarity predicts downstream behavioral consistency (r = -0.35, partial r = -0.45), with a localized temporal and layer-wise signature. The signal replicates across Qwen-2.5-72B and Phi-3-14B, and on StrategyQA (r = -0.83). It does not track correctness: committed-wrong and committed-correct questions are not separable in activation similarity. That boundary is central to the claim. Commitment tells us whether an agent has settled, not whether it is right. A runtime monitor detects inconsistent trajectories from hidden states at AUROC up to 0.97 (0.85–0.88 under a stricter split), and a prompting intervention cuts behavioral variance by 28% against a token-matched control while leaving accuracy statistically unchanged. We also test whether the signal can route self-consistency compute; on a harder benchmark it helps only modestly and is matched by a simpler output-based baseline. The result is a diagnostic for a hidden process failure, with clear limits rather than a general accuracy lever.

[AI-63] Hierarchical Reinforcement Learning for Sparse-Reward Search in Commutative Algebra ICML2026

链接: https://arxiv.org/abs/2606.22922
作者: Giorgi Butbaia,Paul Orland,Coco Huang,Davide Passaro,Lucas Fagan,Michele Tarquini,Hailong Dao,David Eisenbud,Ali Shehper,Sergei Gukov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Commutative Algebra (math.AC); Combinatorics (math.CO)
备注: 21 pages, 15 figures, 3 tables. Accepted at ICML 2026

点击查看摘要

Abstract:Applying machine learning techniques to solving long-standing mathematical conjectures can be particularly challenging due to their extreme reward sparsity. As an illustrative example, we consider Kalai’s algebraic Hirsch conjecture and recast the construction of its counterexamples as a sparse-reward reinforcement learning problem on graphs. We propose a constrained options-based HRL framework with an equivariant graph neural network policy, which allows us to learn useful temporal abstractions for this task. We evaluate our approach over a wide range of degrees and demonstrate that it consistently outperforms classical RL algorithms as well as greedy search. By exploiting the hierarchical structure of the problem, we effectively provide a first-of-its-kind application of HRL to a problem in commutative algebra.

[AI-64] Intent-Governed Tool Authorization for AI Agents

链接: https://arxiv.org/abs/2606.22916
作者: Genliang Zhu,Chu Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents increasingly act through external tools: they read private data, construct structured payloads, submit write requests, export records, and coordinate workflows across application boundaries. Existing authorization mechanisms usually ask whether an integration credential, app, or token can call a tool. That question is necessary but incomplete. A tool call can be authorized by static credentials and still be unjustified by the user’s current request. For example, a credential that can read and export records should not expose export authority when the user only asked for a bounded summary, and a model-generated delete call should not execute merely because the integration has a delete scope. This paper proposes Intent-Governed Access Control (IGAC), a server-side authorization layer that treats the user’s expressed intent as a monotone, auditable policy attribute for AI-agent tool use. IGAC introduces intent certificates, session-scoped policy narrowing, intent-aware manifest filtering, and intent-tool-payload consistency checks. The central invariant is that user intent may only reduce the authority granted by static integration policy; it never expands scopes, data policy, tenant boundaries, or review requirements. We map IGAC onto OpenPort, an existing governance substrate that already implements authorization-dependent discovery, scope and ABAC-style policy checks, draft-first writes, preflight impact binding, state-witness checks, idempotency, stable reason codes, and audit.

[AI-65] hermoLLM : Thermodynamics-Aware HVAC Control with Spatial-Semantic Knowledge Graph

链接: https://arxiv.org/abs/2606.22911
作者: Kirtan Bhatt,Xiachong Lin,Matthew Amos,Flora D. Salim,Wen Hu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 10 pages, 5 figures. Submitted to ACM SIGSPATIAL 2026

点击查看摘要

Abstract:Multi-zone HVAC control is a spatial decision problem in which indoor thermal evolution and control decisions depend not only on outdoor conditions and internal heat gains but also on zone layout, physical adjacency, and delayed thermal interactions across the building. Recent LLM-based HVAC controllers have shown that prompt-based control is feasible. However, these methods typically rely on task descriptions, observation values, short textual feedback, or unstructured retrieval, which limits their ability to reason about zone coupling, thermal response, and building dynamics. This paper presents a thermodynamics-aware LLM control framework for a five-zone EnergyPlus building simulation. The controller is grounded in a physics-informed spatial knowledge graph derived from Brick-style building semantics and linked with recent interaction history. At each control step, the model receives the current building state, graph-structured spatial context, and recent environment-controller history, enabling it to make decisions that reflect both building structure and short-term thermal evolution. We evaluate the framework against standard control baselines and several LLM-based alternatives. Results show that the proposed approach achieves the best overall energy-comfort trade-off and the lowest PMV violation while maintaining energy-efficient operation.

[AI-66] From Frag ments to Paths: Task-Level Context Recovery for Large Industrial Codebases

链接: https://arxiv.org/abs/2606.22906
作者: Jiawei He,Weisong Sun,Mengyu Shi,Jie Jia,Tong Bian,Xikai Yang,Dong Sun
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 3figures

点击查看摘要

Abstract:Large language models have shown strong performance on software engineering (SE) tasks, yet understanding large industrial repositories remains challenging. Existing methods often retrieve only local fragments and fail to recover the broader task-relevant context needed for complex repository-level tasks. We present DeepDiscovery, a task-level repository-understanding method for large industrial codebases. DeepDiscovery uses a two-stage \textitLocation–Inference framework to localize high-confidence task anchors and recover broader task-relevant context over multi-relational repository structure under budget constraints. Across controlled method-level evaluation, organization-internal industrial repository-understanding scenarios, and end-to-end evaluation on SWE-bench Verified, DeepDiscovery consistently improves task-relevant file recovery and downstream SE performance. On 27 medium-scale tasks, DeepDiscovery achieves the best file recovery quality among five representative baselines without offline preprocessing. On organization-internal industrial tasks from a production-scale integrated codebase ecosystem, including 27 medium-scale tasks and 40 large-scale tasks, DeepDiscovery improves Full Recall Rate across multiple AI coding systems, with absolute gains ranging from 1.6 to 9.2 percentage points on large subprojects and from 2.5 to 7.4 percentage points on medium-scale subprojects. In a controlled end-to-end evaluation on SWE-bench Verified, a system equipped with DeepDiscovery achieves a 78.6% Solve Rate, outperforming the corresponding baseline by 8.2 percentage points. These results suggest that stronger task-level repository understanding can improve coding-agent performance on complex SE tasks.

[AI-67] Agent -as-a-Router: Agent ic Model Routing for Coding Tasks DATE

链接: https://arxiv.org/abs/2606.22902
作者: Pengfei Zhou,Zhiwei Tang,Yixing Ma,Jiasheng Tang,Yizeng Han,Zhenglin Wan,Fanqing Meng,Wei Wang,Bohan Zhuang,Wangbo Zhao,Yang You
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 21 figures, a living technical report with a living benchmark that continuously updates

点击查看摘要

Abstract:Real-world users typically have access to multiple Large Language Models (LLMs) from different providers, and these LLMs often excel at distinct domains, yet none dominate all. Consequently, routing each task to the most suitable model becomes critical for both performance and cost. Existing routers treat this as a static, one-off classification problem. However, we identify the performance bottleneck for these routers as information deficit: simply augmenting a vanilla LLM router with performance statistics at the task-dimension level yields a 15.3% relative gain, surpassing a heuristic router built on the same dimension-level priors. Motivated by this finding, we propose Agent-as-a-Router, a framework that formalizes routing as a C-A-F loop (Context-Action-Feedback-Context). It closes the information gap by accumulating execution-grounded experience during deployment. We instantiate this framework as ACRouter, composed of an Orchestrator, a Verifier, a Memory module, and introduce CodeRouterBench, an evaluation environment comprising ~10K task instances with verified scores from 8 frontier LLMs, enabling regret-based router comparison on streaming tasks. Experiments show that ACRouter achieves the lowest cumulative regret on in-distribution tasks and generalizes to out-of-distribution agentic-programming tasks, demonstrating that our routing framework actively closes the information gap. Codes and benchmarks are released at this https URL.

[AI-68] CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

链接: https://arxiv.org/abs/2606.22883
作者: Zhanbo Hua,Yifan Yao,Weihao Xie,Yongchi Zhao,Minghao Liu,Ruizhi Qiu,Zhewei Huang,Zun Wang,Yiyan Ji,Yunhai Ye,Letian Zhu,Xinping Lei,Han Li,Zhiyuan Ma,Zili Wang,Zhaoxiang Zhang,Jiaheng Liu
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures, 3 tables. Preprint

点击查看摘要

Abstract:While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.

[AI-69] Priority-Aware Learning-Unlearning Correction for Dynamic Decentralized LoRA Fine-Tuning

链接: https://arxiv.org/abs/2606.22878
作者: Nuocheng Yang,Yechen He,Sihua Wang,Zihan Chen,Tony Q. S. Quek,Changchuan Yin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed at the network edge to provide pervasive generative AI services, decentralized federated learning (DFL) provides a vital mechanism for privacy-preserving, domain-specific fine-tuning through peer-to-peer exchanges of parameter-efficient updates. However, the dynamic nature of practical decentralized edge networks, where devices may dynamically join or leave the collaborative training process, requires the system to continuously adapt to new data while selectively removing prior contributions. This correction process remains a significant bottleneck, as individual device updates become deeply entangled within the global fine-tuned parameters. To address this challenge, we propose a priority-aware learning-unlearning correction framework based on orthogonal LoRA that can enhance the knowledge evaluation through topology adjustment. Specifically, we first design an orthogonal LoRA mechanism that yields post-training contribution coordinates, enabling history-free projection addition and deletion in response to membership changes. We then analyze the correction bottleneck and develop a priority-aware policy that selects among topology refinement, local correction, proximal damping, and synchronization scheduling according to the dominant residual term. A resource allocation algorithm is further developed to allocate limited communication across layer groups, prioritizing the primary bottlenecks within per-round wireless constraints. Experiments demonstrate that the proposed framework achieves robust post-event correction for both device join and leave events and validate that different residual regimes necessitate distinct correction actions.

[AI-70] SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

链接: https://arxiv.org/abs/2606.22874
作者: Huzama Ahmad,Se-Young Yun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Long contexts have become standard in pretrained LLMs, yet they remain expensive to run: prefill compute grows quadratically with sequence length, and every decode step re-reads a key-value cache that grows linearly with it. Sparse attention cuts these costs by attending only to a relevant subset of past tokens, but selecting that subset is itself expensive. We present SpotAttention, a lightweight selector that attaches to a frozen pretrained transformer and learns by KL distillation to estimate its attention distribution. The selector picks the top-K keys each query attends to, and because its estimate is a calibrated distribution, a dual top-p rule reads the per-query, per-layer budget directly from it. Across Qwen3 (dense, 4B-32B) and Qwen3.5 (hybrid linear/full attention, 4B-9B), SpotAttention matches dense accuracy at contexts up to 128K tokens, eight times the training length. Decode at L=128K runs 3.9x faster than FlashAttention and 1.8x faster than Twilight, the strongest training-free baseline. Quantizing the selector’s K-cache to INT4 or FP4 microscale shrinks it 3.5x at no accuracy cost.

[AI-71] Discovering Crystal Structure Prediction Algorithms with an AI Co-Scientist

链接: https://arxiv.org/abs/2606.22866
作者: Kiyoung Seong,Nayoung Kim,Sungsoo Ahn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Human-AI Co-discovery system (HACO) for scientific algorithm discovery through cross-domain search and sparse human steering. Starting from the goal of generating crystal structures from chemical compositions, HACO searched across generative modeling methodologies from multiple fields and identified MaskGIT, a masked generative model from vision, as a promising framework for crystal structure prediction (CSP). HACO instantiated this masked formulation as a discrete token model of crystal structure; guided by sparse high-level human objectives, it then added crystallographic symmetry tokens, space group stratified sampling for polymorph coverage, and sub-bin coordinate refinement, yielding the Masked Generative Crystal Transformer (MaskGXT). On the MP-20 polymorph split, MaskGXT reaches 79.06% match-everyone-to-reference (METRe) accuracy, compared with 70.87% for the strongest evaluated baseline. MaskGXT also attains the best match rate on standard MP-20 and MPTS-52 CSP benchmarks. These results provide evidence that, in domains offering cheap, fast, and well-aligned validation, transfer-guided interactive AI co-scientists can contribute to scientific algorithm discovery by identifying transferable modeling principles and combining them with targeted human domain guidance.

[AI-72] AI Scientists as Engines of Discovery: A Case for Development within Reformed Institutions

链接: https://arxiv.org/abs/2606.22859
作者: Raul Jimenez,Boris Bolliet,Francisco Villaescusa-Navarro,Rabih Zbib,Benjamin Wandelt,David N. Spergel,Thomas Meier,Jessica Montgomery,Hana Aliee,Licia Verde
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Agentic artificial intelligence (AI) systems are beginning to assist, accelerate, and partially automate scientific discovery, performing tasks that span literature synthesis, code generation, data analysis, hypothesis proposal, and model criticism. We argue that this transition is qualitative rather than incremental, and that suitably designed multi-agent systems may evolve from passive computational tools into ``AI scientists’’ that can expand the hypothesis-generating and verification capacity of science. Such systems must be developed and deployed within a scientific ecosystem fit for purpose: institutions must be redesigned for verification, accountability, interpretability, and dual-use safety. We sketch how multi-agent architectures, illustrated by the prototype framework \textitDenario, accelerate the discovery cycle and traverse model spaces beyond human reach; examine what this implies for authorship, peer review, and the enduring role of human scientists; and close with recommendations for governing AI as an epistemic actor rather than a mere instrument.

[AI-73] he Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks NEURIPS

链接: https://arxiv.org/abs/2606.22858
作者: Sannaan Khan,Muhammad U. S. Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS Workshops 2025

点击查看摘要

Abstract:As machine learning models grow more influential and opaque, algorithmic fairness and explainability are critical for ensuring accountability. However, we demonstrate that these auditing mechanisms are themselves vulnerable to subtle manipulation, camouflaging the influence of protected features. While prior work on data-agnostic attacks has exposed this vulnerability, they leave behind detectable artifacts that compromise their stealth. We introduce Targeted Identity Re-Association (TIRA) attacks, a novel family of attacks that iteratively and probabilistically manipulate a model’s outputs without requiring access to the model’s internals or feature representations. We formalize two algorithms: Probabilistic Micro-Shuffling (PMiS), which applies localized adjacent swaps, and Probabilistic Rank-Shift Micro-Perturbation (PRSMP), which introduces small, randomized rank shifts. We empirically demonstrate that TIRA attacks are highly effective at pushing fairness metrics towards ideal values. Crucially, TIRA attacks successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features, a major improvement over prior work.

[AI-74] CLIP-guided Diffusion Model for Backdoor Generation in Sensor-based Human Activity Recognition

链接: https://arxiv.org/abs/2606.22837
作者: Toby Briston,Illya Kosyk,Kuniyih S
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Sensors are critical components of modern intelligent devices. The proliferation of the Internet of Things (IoT) and wearable mobile devices has enabled the integration of such sensors to monitor the environment and enable users to take predictive actions. Human activity recognition (HAR) is a popular application in which Inertial Measurement Unit (IMU)-based sensors, such as accelerometers and gyroscopes, are used to provide insights into health, training, and medical diagnosis. However, the accuracy of such a model is hindered by the lack of data. The diffusion model-based technique has proven successful in generating synthetic data for training HAR models. In this paper, we propose a backdoor training technique, IMU-DM-CLIP, that leverages a diffusion model to enable trigger-based attacks on HAR models. Our empirical analysis shows that the attack is successful even with a very small backdoor injection rate of 10% and 10% of the data guided for the diffusion model.

[AI-75] Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation

链接: https://arxiv.org/abs/2606.22830
作者: Jinwei Xiao,Zhuowen Han,Yueqing Sun,Zhengxi Lu,Yuxin Liu,Zhiyuan Yao,Wentao Chen,Qi Gu,Xunliang Cai
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation transfers reasoning ability through dense token-level supervision, yet the nature of the transferable signal remains unclear. We discover that reasoning chains contain two types of knowledge that require different discovery mechanisms: decisions (where to branch), which surface through student uncertainty, and evidence (intermediate steps that justify decisions), which hides in positions where the student is confident yet wrong. Current methods capture only decisions; the substantive knowledge in evidence tokens remains untransferred. We propose DEAR(Decision-Evidence Aware Reasoning Distillation), which first identifies decisions via student entropy, then discovers their supporting evidence through hidden-state cosine similarity to decision anchors, boosted by teacher-student divergence to prioritize the largest knowledge gaps. Across three student-teacher configurations on math and code benchmarks, DEAR consistently outperforms standard OPD, with up to +2.5pp on competition math and +5.7pp on code generation.

[AI-76] MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration

链接: https://arxiv.org/abs/2606.22826
作者: Devleena Das,Rajeev Patwari,Vikram Kumar Bukka,Nithin Kumar Guggilla,Elliott Delaye,Ashish Sirasao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating LLMs across many model variants – quantized, fine-tuned, or deployment-specific – requires running large benchmarks repeatedly, a process that can take tens of hours per model on edge hardware such as NPUs. Existing subset selection methods reduce this cost but depend on large calibration pools or learned prediction layers. We introduce MINCE (Monte Carlo Informed N-sizing for Compact Evaluation), which uses Monte Carlo simulation over per-item logs from a small set of calibration models to find the minimum subset size that bounds accuracy drift and then fixes a randomly sampled subset at that size, with no prediction layer needed. MINCE reduces IFEVAL by 54%, MMLU by 89%, and GSM8K by 70% with maximum drift \leq 2.62,pp on BF16 models and mean drift of 0.77–3.59,pp on held-out NPU models, while delivering median GPU evaluation speedups of 2.7–8.1 \times and NPU evaluation speedups of 1.7–2.0 \times . The method is robust to calibration pool size and achieves lower drift than tinyBenchmarks (12 \times lower on MMLU, 3.3 \times on GSM8K) while using 57 \times fewer calibration models.

[AI-77] Active Inference as the Test-Time Scaling Law for Physical AI Agents

链接: https://arxiv.org/abs/2606.22813
作者: Omar Hashash,Christo Kurisummoottil Thomas,Walid Saad,Merouane Debbah,Karl Friston,Adeel Razi
类目: Artificial Intelligence (cs.AI)
备注: 53 pages, 13 figures

点击查看摘要

Abstract:In this paper, a novel test-time scaling law for physical artificial intelligence (AI) agents is introduced. This scaling law enables physical AI agents to reason with their world models to generalize in unforeseen scenarios at test time. The derived scaling law is grounded in the first principle of active inference, which equips agents with the general objective to survive in the real world, under which their specific task objectives are subsumed. Active inference achieves this by providing the reasoning to resolve prediction errors that arise when the agent encounters unforeseen situations outside its training distribution, enabling generalization in non-stationary environments. The proposed scaling law captures this by dynamically updating the agent’s policy with this reasoning at test time. This policy update is modeled as a soft Bayesian inference process in which beliefs about the policy are updated using the reasoning that reduces expected prediction errors under allowable policies as a likelihood. The resulting posterior policy admits a biological interpretation, recovering the scaling mechanism that engages the brain’s basal ganglia and prefrontal cortex at test time. To solve this analytically intractable problem, a variational inference solution minimizing free energy bounds is developed. This solution extends to enable learning beyond training by reinforcing new instances, resolved at test time, in both the policy and world model. Unlike existing scaling laws constrained by model size and training data, the derived solution scales with the continuous real-world experience of a physical AI agent. Simulation results on an autonomous driving task demonstrate that the proposed solution outperforms model-free Q-learning and model-based Bayesian reinforcement learning, achieving robust generalization to unforeseen scenarios while improving inference efficiency by over 36%.

[AI-78] AI-Assisted Help-Seeking Trajectories in Programming Education from an SRL-Informed Perspective

链接: https://arxiv.org/abs/2606.22809
作者: Boxuan Ma,Huiyong Li,Gen Li,Li Chen,Atsushi Shimada,Shin’ichi Konomi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI tools provide novice programmers with instant, personalized support, but also raise concerns about whether AI use supports or bypasses students’ regulation of problem-solving. Existing work has largely focused on correctness, usability, or overall usage frequency, with less attention to how student–AI help-seeking unfolds. This study addresses this gap by analyzing AI-assisted help-seeking trajectories in university-level programming. Using an SRL-informed analytical framework that links prompt-level help-seeking codes to conceptual, implementation, debugging, and reflective forms of support, we analyzed 1,290 task-specific student prompts linked to 17,190 code submissions from 71 students in introductory Python programming courses. Specifically, we examined how help-seeking interactions were structured across turns and attempts, and how trajectory patterns related to task scores and the number of code submissions. Results indicate that many students primarily used AI for reactive troubleshooting rather than for planned, self-regulated problem-solving. Although trajectory patterns were not associated with significant differences in task scores, they differed substantially in the number of code submissions required. These findings suggest that the educational significance of AI support lies not only in whether students use AI, but in how their help-seeking trajectories develop during programming problem-solving.

[AI-79] Measuring Behavior Portability in Large Language Models

链接: https://arxiv.org/abs/2606.22797
作者: Tianjia Dong,Nadav Kunievsky,James A. Evans
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as autonomous decision makers, yet the behavioral mapping they exhibit can vary substantially across decision environments that are payoff-equivalent by construction-environments that share identical payoff-relevant structure but differ in surface presentation. This sensitivity renders suite-based evaluation fragile and raises a fundamental question of behavioral portability: how well does a behavioral mapping learned in one decision environment informative on another that preserves the same underlying incentive structure? We introduce a formal framework to measure this property. Our protocol fits an interpretable behavioral model on data pooled from a set of source environments and evaluates its out-of-sample predictive performance in a held-out target environment, benchmarking against an oracle trained directly on target data. Portability is quantified via a loss-agnostic measure that delivers worst-case bounds on the performance of the induced prediction-action mapping in the target environment. In controlled experiments spanning seven canonical economic decision problems, we document substantial and systematic portability losses, suggesting that behavioral characterizations of LLMs obtained in one decision environment cannot be assumed to transfer reliably to structurally equivalent alternatives.

[AI-80] A Formula-Driven Survey and Research Agenda for On-Policy Distillation

链接: https://arxiv.org/abs/2606.22793
作者: Bowen Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) trains an LLM on states induced by the current or recent student policy: the student generates complete or partial rollouts, a teacher or self-teacher scores the resulting tokens under their generated contexts, and dense log-probability, logit, or distributional signals are converted into post-training updates. This survey studies OPD as a feedback-to-update problem rather than a single loss family. We develop a formula-driven taxonomy from two routes – direct distributional losses and policy-gradient-style log-ratio updates – and use it to organize core methods, verifier- or outcome-guided hybrids, industrial reports, framework implementations, failure modes, and stabilization recipes under explicit evidence boundaries. The taxonomy shows that OPD effectiveness depends not only on KL direction or teacher access, but also on state compatibility, support construction, temporal credit, vocabulary-level probability routing, gates and weights, and regularization. We further separate two mechanisms often conflated in sampled-token OPD stability discussions. Temporal credit asks how teacher-student log-ratio returns should weight sampled actions across a rollout; vocabulary routing asks where probability mass should move when negative feedback suppresses a sampled token. This distinction yields bias boundaries for immediate, return-to-go, discounted, and baseline-corrected estimators, motivates GAE-OPD as a value-based hypothesis for log-ratio returns, and motivates Counterfactual Routed OPD (CR-OPD) for routing probability mass toward teacher-supported, student-reachable alternatives. We close by mapping actionability diagnostics, failure mechanisms, case studies, open problems, and a reporting checklist onto the same feedback-to-update variables.

[AI-81] he Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

链接: https://arxiv.org/abs/2606.22792
作者: Xiang-Jun Ou,Shuang Liang,Xin-Yu Hu,Rong-Hao Huang,Jing Wang,Shao-Qun Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have enabled sophisticated reasoning and content generation, yet their inherent stochasticity poses significant challenges for ensuring predictive credibility. While traditional uncertainty taxonomy paradigms, such as the dichotomy of aleatoric and epistemic uncertainties, provide conceptual foundations, they often fail to capture the multi-component and multi-stage nature of LLM generation and struggle to evaluate the effectiveness of various Uncertainty Quantification (UQ) methods. In this paper, we propose a granular uncertainty taxonomy that systematically attributes LLM uncertainty into input-level, parameter-level, token-level, and decoding-process sources. Correspondingly, we categorize existing UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches. Furthermore, we introduce a comprehensive evaluation framework covering diverse generation settings and metrics. We empirically evaluate 21 typical UQ methods across three prominent LLM families, including Qwen3, Llama 3.2, and DeepSeek-V3, on benchmarks such as TriviaQA, GSM8K, and HumanEval. Our experimental results demonstrate that (i) the effectiveness of UQ methods is sensitive to task types and generation settings; (ii) consensus-based methods, typed Deg and EigV, consistently outperform other UQ approaches; and (iii) larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law for LLM uncertainty. This work bridges the gap between theoretical origins and practical deployment, providing a versatile diagnostic tool for systematically quantifying uncertainty in LLM applications.

[AI-82] Scaling Audio Models Efficiently: A Joint Study of Compute Constraints and Optimization Behavior

链接: https://arxiv.org/abs/2606.22790
作者: Vyom Agarwal,Mokshda Gangrade,Siddharth Pal,Jerry Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate the tradeoffs between compute allocation and model performance for two speech processing tasks: Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). We propose a unified framework that analyzes three fundamental compute dimensions: model size ( x_N ), input length ( x_T ), and representation resolution ( x_V ). Motivated by recent advances in compute optimal scaling for multimodal models, we systematically vary these dimensions to examine their influence on task performance under fixed computational budgets. Our study provides insights into how compute resources can be optimally distributed across model capacity, temporal context, and representational granularity, offering practical guidelines for the design of efficient speech models. Through experiments on LibriSpeech and CREMA-D datasets, we demonstrate non-linear scaling behavior and identify optimal operating points. Our results show that (1) increasing model size yields diminishing returns: scaling Tiny (39M) to Small (244M) reduces WER by 8.22%, whereas Small to Medium (769M) reduces WER by only 2.35%; (2) an optimal audio duration of approximately 4 seconds exists for SER; and (3) reducing encoder token resolution provides an effective mechanism for lowering inference cost, Large-v3 (1540M) with 750 frames requires 2572 GFLOPS whereas with 1500 frames requires 5228 GFLOPS, with less than 3% relative increase in WER. Additionally, LoRA-based adaptation enables efficient finetuning with minimal performance degradation.

[AI-83] Learning Filters with Certainty

链接: https://arxiv.org/abs/2606.22786
作者: Yuval Banoun,Daniel Sadoc Menasche,Ori Rottenstreich
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Hash-based data structures such as Bloom filters are widely used in network systems for tasks including caching, anomaly detection, and machine learning pipelines. They typically provide binary indications of whether an element belongs to a set of interest, e.g., the contents of a cache. When uncertainty arises due to hash collisions, a positive indication is returned to avoid false negatives. We argue that the certainty associated with such indications can itself be useful information. This work focuses on Counting Bloom Filters (CBFs), a Bloom-filter variant that maintains counters rather than bits. Besides supporting insertions and deletions, these counters provide additional information that can be used to estimate the certainty of positive membership indications. We show how this certainty signal can be exploited in architectures that combine Bloom Filters with machine learning (ML) models.

[AI-84] Explainable AI for Mental Health Prediction in Drug-Affected Populations with Drag onfly Algorithm and GAN Oversampling

链接: https://arxiv.org/abs/2606.22780
作者: Ahnaf Atef Choudhury,Shahriar Siddique Ayon,Md. Ebrahim Hossain,Abdullah Al Mamun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for presenting at the 16th International Conference on Advanced Computer Information Technologies (ACIT 2026). To appear in IEEE conference proceedings

点击查看摘要

Abstract:Mental illnesses among drug users are an increasing international issue, particularly in regions where early detection cannot be easily undertaken. The current literature tends to ignore the use of AI-based mental health analysis in drug users, and low quality of the class imbalance treatment, low interpretability, and optimal hyperparameter optimization can lower predictive quality and clinical utility. This study present a detailed, explainable machine learning (ML) model of multiclass mental health prediction, using a multidimensional data set of drug-affected persons. We combine hybrid PCA-Information Gain (PCA-IG) feature selection, Generative Adversarial Network (GAN)-based oversampling, and Dragonfly Algorithm (DA)-optimized XGBoost to address some of the limitations of existing methods. The suggested framework is effective to work with high-dimensional categorical data, address the issue of class imbalance, and improve predictive performance due to intelligent hyperparameter tuning. The experimental findings show that the XGBoost model optimized using the DA, in combination with GAN-based oversampling, has an accuracy of 94.17% and a weighted F1-score of 93.80%, which is better than the traditional and baseline models. The behavioral, lifestyle, and health factors, particularly sleep quality, physical health, and emotional regulation, are strongly predictive of mental health, with demographic factors having little impact, as seen through feature analysis. SHAP-based explainable AI provides easy-to-understand, instance-level information, enhancing interpretability and trust in models to be used in clinical settings. The results indicate that this framework has the potential to generate valid mental health forecasting tools, which would facilitate early intervention and enhance the treatment of drug-influenced people.

[AI-85] GeoRouteNet: Geometry-Enhanced Non-Autoregressive Neural Solver for the Traveling Salesman Problem

链接: https://arxiv.org/abs/2606.22776
作者: Xiang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures, 9 tables

点击查看摘要

Abstract:The traveling salesman problem (TSP) is a canonical NP-hard combinatorial optimization benchmark that tests the representational capacity and generalization of neural solvers. While non-autoregressive (NAR) approaches offer parallel inference, they often lack sufficient geometric inductive bias and stable training signals, leading to degraded performance under cross-scale and cross-distribution shifts. We propose GeoRouteNet, a geometry-enhanced NAR neural solver for Euclidean TSP. On the model side, GeoRouteNet incorporates centered node features, learnable radial distance basis functions, distance-aware graph attention with explicit edge messaging, LayerNorm-SwiGLU feed-forward blocks, and cross-layer attentive residual mixing. On the training side, we design multi-candidate self-comparison reinforcement learning (MCS-RL), which samples multiple candidate tours per instance, constructs adaptive baselines from greedy and peer candidates, and adds winner-candidate guidance with annealed entropy regularization. On 10,000 random TSP50 instances, GeoRouteNet achieves a 0.32% optimality gap under Beam-1000 decoding. On TSP100, the gap is 1.26%. On 27 stratified TSPLIB EUC_2D instances, the overall gap drops from 17.12% (NAR4TSP reproduction) to 3.60%, while batch inference throughput substantially exceeds that of Concorde and LKH3. Ablation studies confirm that geometric structure enhancement and multi-candidate training are complementary: structure improvements dominate cross-distribution gains, while MCS-RL further stabilizes solution quality when paired with a strong geometric encoder.

[AI-86] Noise is Signal: Density-Based Outliers as Leading Indicators of Occupational Emergence in Labor Market Text

链接: https://arxiv.org/abs/2606.22769
作者: Shreyash Rawat
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard NLP pipelines for occupational clustering discard the 10-15% of job postings that density-based methods assign to noise. We argue this is an error: in rapidly evolving domains, low posting density signals novelty, not incoherence. We formalize this as the Emergence-Density Inversion (EDI) hypothesis and test it longitudinally on 84,988 job postings across eight quarters (Q4 2022-Q3 2024). EDI is partially confirmed: high-EOS outlier groups transition to stable clusters in 1.4 +/- 0.6 quarters vs. 4.1 +/- 1.2 for low-EOS groups (p 0.001), though the signal fails in approximately 19% of cases, which we characterize as a failure analysis. We extend the Emerging Occupation Score (EOS) with Temporal Velocity and Cross-Platform Convergence, improving 2-quarter cluster-formation prediction from F1 = 0.61 to 0.74, outperforming Isolation Forest, LOF, GLOSH, and BERTrend baselines. A retrospective study on three now-established roles (MLOps Engineer, DevOps/SRE, Data Engineer) confirms EOS signalled 2-3 quarters before cluster formation, providing held-out validation. A held-out annotator panel (kappa = 0.74) rates EOS 0.75 as coherent emerging occupations with 77% precision. Prompt Engineer, AI Safety Researcher, Foundation Model Engineer, and Agent Systems Engineer, all absent from O*NET, are top-4 in Q3 2024 and form stable clusters by Q1 2025.

[AI-87] Evolutionary Optimization Reveals Structural Constraints on Reservoir Architecture for Spatiotemporal Chaos

链接: https://arxiv.org/abs/2606.22765
作者: Nima Dehghani
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Biological systems maintain function in fluctuating environments by transforming past stimulation into internal dynamical states that support future-oriented responses. Reservoir computing provides a computational analogue, but standard formulations often treat the recurrent substrate as a fixed random network and train only the readout. Here we ask how the substrate itself changes when reservoir architecture is placed under evolutionary selection for prediction. Using the Kuramoto–Sivashinsky equation as a testbed for spatiotemporal chaos, we evolved reservoirs over five construction hyperparameters: size, connectivity degree, spectral radius, input scaling, and readout regularization. Evolution reduced prediction error at the population level, extended the low-error forecast horizon, and organized the design space along a diminishing-return size–efficiency frontier. Structural analyses showed that evolved reservoirs remained within a conserved stochastic-block-model-like spectral envelope while refining low-eigenvalue modes, locking modularity to an intermediate band, and pruning connection cost within that band. Pareto analysis showed that elite reservoirs occupied a horizontal floor in the cost–modularity plane, indicating that accuracy and efficiency were achieved jointly rather than through a simple trade-off. These findings show that evolutionary optimization does not merely improve prediction, but exposes interpretable structural constraints on the recurrent substrate: it stabilizes a task-suitable dynamical class and refines the architectural degrees of freedom most relevant for prediction. Evolutionary reservoir computing therefore provides a bio-inspired framework for studying how predictive demands shape adaptive dynamical networks.

[AI-88] xt Dictates Music Decorates: Energy-based Attention for Editable Dance Motion Generation ECCV2026

链接: https://arxiv.org/abs/2606.22726
作者: Seong Jong Yoo,Siyuan Peng,Felix Gu,Stratis Aloimonos,Cornelia Fermüller
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Choreographic motion generation poses unique challenges for AI, demanding precise semantic control over complex, temporally structured, and expressive full-body dynamics. While existing models can synthesize motion from music, they remain largely black boxes. Conversely, attempting to condition generation on both text and music frequently leads to modality collapse, where dense acoustic rhythms overwhelm sparse semantic text prompts, destroying user controllability. To resolve this spatial-temporal conflict, we propose STREAM (Structural-Temporal Rhythmic Energy-based Attention for Motion), a modality-decoupled diffusion transformer. STREAM strictly separates conditioning pathways: global text semantics dictate the kinematic structure via Adaptive Layer Normalization (AdaLN), while a novel Bimodal Energy-Based Attention Module (BEAM) routes these features to the musical beat without overwriting the semantics. We further introduce Motorica++, a newly curated dataset enriched with domain-specific dance vocabulary and frame-level semantic annotations from existing Motorica dataset. Additionally, to rigorously quantify zero-shot editability, we propose the Exchange Evaluation Protocol and Editable Dance Score (EDS). Through extensive experiments, STREAM achieves state-of-the-art alignment between motion and music while perfectly preserving choreographic semantics, positioning AI not merely as a reactive synthesizer, but as a controllable, collaborative partner for artistic direction. The source code and datasets are available at this https URL.

[AI-89] Subspace-Constrained Federated Learning with Low-Rank Adaptation

链接: https://arxiv.org/abs/2606.22724
作者: Neranjan Senarath,Rohit Muralitharan,Sadia Asif
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated low-rank adaptation methods are attractive for fine-tuning large models under communication and privacy constraints, but heterogeneous client data can induce geometric misalignment between local low-rank updates. We study whether this subspace misalignment leads to destructive aggregation and slower convergence in LoRA-based federated learning. We propose a subspace-regularized federated LoRA objective that encourages local client updates to remain close to a shared global reference subspace. We present a complete empirical evaluation on two pretrained models, RoBERTa-large and SmolLM-360M, over HellaSwag in a non-IID 10-client federated setting, across 3 random seeds (42, 43, 44), yielding 24 total experimental runs (4 methods x 3 seeds x 2 models). On RoBERTa-large, Subspace-Reg achieves the strongest mean best accuracy (0.454 +/- 0.023), mean final accuracy (0.429 +/- 0.011), and lowest final loss (1.363) across all three seeds, outperforming FedAvg, SVD redistribution, and FedSVD baselines by a large margin. On SmolLM-360M, FedAvg leads on accuracy, revealing that accuracy gains are model-dependent. Crucially, Subspace-Reg achieves near-perfect basis overlap, approximately 0.9999, on both models and across all seeds, versus 0.958 to 0.991 for all baselines, providing robust support for the geometric alignment hypothesis. The code is publicly available at this https URL.

[AI-90] Beyond Simpsons Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship KDD2026

链接: https://arxiv.org/abs/2606.22711
作者: Haoran Yu,Xiaochong Jiang,Lifei Liu,Su Wang,Pin Qian,Yihang Chen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages. Accepted at the KDD 2026 Workshop on Agentic Software Engineering (SE 3.0)

点击查看摘要

Abstract:Pooled across five AI coding agents, pull requests (PRs) with a human Co-Authored-By trailer merge less often than purely-autonomous ones (53.8% vs. 79.8%) – yet this aggregate finding is a textbook Simpson’s Paradox. Stratifying 33,596 PRs from the AIDev dataset by agent identity reverses the conclusion: Copilot and Devin show large positive within-agent gaps (+41.2 and +33.5 pp, both p0.001), while Cursor, Claude Code, and Codex show small effects whose cross-sectional 95% CIs span zero. The paradox is driven entirely by agent composition: Codex, which dominates 64.9% of the dataset, achieves high merge rates while rarely using co-authorship. But Simpson’s Paradox is only the first layer of a cascade of confounders: within-repo controls eliminate Devin’s gap (+33.5 to +1.6 pp, p=0.73); a commit-count control further halves Copilot’s within-repo gap (+36.2 to +24.4 pp); restricted to multi-commit PRs, the Copilot within-repo effect dissolves to +4.8 pp (p=0.59). No agent retains a clear co-authorship effect once both repository selection and PR structure are controlled. Our findings caution against reporting agent-pooled statistics without stratification and demonstrate that cross-sectional co-authorship associations are largely selection and PR-structure artefacts rather than evidence of a causal benefit.

[AI-91] Libretto: Giving LLM Agents a Sense of Musical Structure

链接: https://arxiv.org/abs/2606.22708
作者: Yichen Xu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative music systems can now produce impressive audio from text prompts, but audio outputs are difficult to inspect, edit, and diagnose as musical structure. We introduce Libretto, an agent-facing framework for symbolic music generation and revision. Libretto uses an LLM-native grammar with explicit onset slots, voices, and bar-level organization, then evaluates each piece in a corpus-calibrated statistical space over rhythm, harmony, melody, texture, form, and variation. The same structural axes support retrieval, diagnosis, copy-risk control, and iterative self-revision. Across gap filling, reference-guided full-piece generation, gradual morphing, and educational music generation, Libretto turns symbolic music from a raw token sequence into a measurable and editable object for language-model agents.

[AI-92] Safety-Aware Evaluation of LLM -Generated Driver Intervention Messages through Multi-Task Risk Fusion

链接: https://arxiv.org/abs/2606.22706
作者: Keito Inoshita
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing driver intervention systems rely on auditory alerts and fixed templates, failing to leverage multi-task recognition outputs. General-purpose metrics such as BLEU and BERTScore cannot capture intervention-specific quality dimensions including risk-urgency alignment, cognitive load, and driver acceptability. In this paper, we propose the Driver Safety-Aware Intervention Score (DSAIS), a domain-specific metric evaluating five dimensions through a hybrid architecture combining lightweight rule-based computation with LLM Judge evaluation, together with an end-to-end framework integrating four-task recognition outputs into an LLM through risk fusion, state history management, and dynamic prompt construction. Experiments on the AIDE dataset with five models and seven conditions demonstrate that DSAIS achieves ICC 0.798-0.840 across three architecturally distinct judges and Cohen’s d 1.5 across all control conditions. Multi-dimensional sub-score analysis quantifies the contextual adaptability gap between rule-based and LLM-based systems, revealing that multi-task integration improves contextual relevance by 9.1% over rule-based baselines. Ablation experiments demonstrate that each framework component contributes to contextual relevance, with sub-score decomposition revealing gains that aggregate scoring masks. Driver emotion recognition is identified as the most critical upstream factor, and compact local LLMs (7B–9B parameters) achieve quality superior to API-based models, providing practical design guidelines for in-vehicle deployment.

[AI-93] he Geometry of Refusal: Linear Instability in Safety-Aligned LLM s ACL2026

链接: https://arxiv.org/abs/2606.22686
作者: Shivam Ratnakar,Kartikeya Vats
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at TrustNLP 2026 (Sixth Workshop on Trustworthy Natural Language Processing, co-located with ACL 2026)

点击查看摘要

Abstract:Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the “refusal direction” by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a “Late Decision” topology that is easily bypassed by CLS (reaching 95% ASR in approximately one second), others like Qwen-2.5 demonstrate “Early Divergence” by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector “hardens” models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable “safety axis” that serves as both a critical vulnerability and a precise primitive for defense.

[AI-94] RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

链接: https://arxiv.org/abs/2606.22678
作者: Meher Bhaskar Madiraju,Meher Sai Preetam Madiraju
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 tables, 1 figure

点击查看摘要

Abstract:Agentic coding harnesses - such as Agent-Skills, Superpowers, and Agent-Rigor - are increasingly deployed to augment underlying LLMs for real-world software engineering tasks. Existing benchmarks evaluate these agents almost exclusively on outcome correctness: whether generated code passes tests or resolves issues. We argue that this outcome-only lens is insufficient: an agent that arrives at a correct solution through reckless trial-and-error, without planning, verification, or graceful recovery, is fundamentally less reliable than one that follows sound engineering discipline. We introduce RigorBench, the first benchmark designed to measure process discipline in AI coding agents. RigorBench evaluates these harnesses across five pillars: Planning Fidelity, Verification Coverage, Recovery Efficiency, Abstention Quality, and Atomic Transition Integrity. A composite RigorScore aggregates these dimensions into a single metric via a weighted sum. We curate a suite of 30 tasks spanning five categories - Plan-Then-Build, Verify-Or-Die, Doom Loop Gauntlet, Know When to Fold, and Don’t Break the Build-and evaluate leading harnesses in a controlled with/without experimental design against baseline coding assistants. Our results show that structured process discipline not only improves process quality scores by an average of 41% but also raises downstream outcome correctness by 17%, providing the first quantitative evidence that how agents code matters as much as what they produce. We release the full benchmark, scoring rubrics, and trajectory analysis tools as open-source artifacts.

[AI-95] Skin-Deep: A Geometric Diagnostic for Alignment Frag ility in Large Language Model Representations

链接: https://arxiv.org/abs/2606.22676
作者: Dongyub Jude Lee,Jungseob Lee,Seungyoon Lee,Seongtae Hong,Suhyune Son,Sugyeong Eo,Jaehyung Seo,Heuiseok Lim
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, 12 tables. The first two authors contributed equally. Code (pre-attack GFS diagnostic): this https URL

点击查看摘要

Abstract:Alignment tuning is meant to make harmful-request refusal robust, yet this safety behavior can be erased by a small set of benign fine-tuning examples. This is a deployment risk for open-weight models because a checkpoint can pass refusal tests at release time and later lose refusal under low-cost downstream fine-tuning. Prior work has established these refusal failures, but existing studies do not show how to detect this fragility in the aligned model itself before an attack or fine-tuning intervention is run. We introduce Skin-Deep, a geometric diagnostic that detects alignment fragility directly from the aligned model’s hidden-state activations before such an intervention is run and compresses the layer-wise safety geometry into a single scalar, the Geometric Fragility Score (GFS). Applied to twenty-one instruction-tuned models spanning six alignment recipes and 3B–32B parameters, Skin-Deep reveals a recurring low-rank safety subspace across model families. Direction ablations show that removing directions in this subspace weakens harmful-request refusal, providing causal evidence that the recovered geometry underlies refusal behavior. Crucially, GFS identifies, before any fine-tuning, the initially safe model that retains the most refusal after small-scale LoRA fine-tuning. These results establish GFS as a practical pre-deployment diagnostic for flagging fragile refusal behavior without running an attack.

[AI-96] Agent Lens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent

链接: https://arxiv.org/abs/2606.22673
作者: Weidi Luo,Qiming Zhang,Yihao Quan,Mingyu Jin,Jie Cai,Chaowei Xiao,Jingcheng Niu,Zhen Xiang
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Coding agents based on large language models (LLMs) demonstrate remarkable autonomous capabilities, but they also introduce significant safety and misuse risks during multi-turn interactions with external environments. Existing safety mechanisms mainly rely on external guardrails, which have a limited ability to perform fine-grained behavioral control during execution. Meanwhile, recent mechanistic interpretability methods for LLM safety are mostly confined to single-turn or jailbreak-style QA settings, limiting their ability to capture the evolving risk dynamics of multi-turn agent execution. In this paper, we investigate the safety of multi-turn coding agents from an internal perspective. We propose AgentLens (Mechanistic Subspace Intervention and Steering), a white-box defense framework that performs runtime safety detection and representation-level mitigation for coding agents. Unlike conventional agent guardrails, AgentLens detect harmful execution states from step-level hidden representations and mitigate unsafe behavior by intervening in a 10-dimensional subspace within a single layer. To support this research, we introduce the Mechanistic Agent Safety (MAS) benchmark, comprising comprehensively annotated multi-turn execution trajectories across 194 tasks using LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B. Extensive experiments show that AgentLens achieves strong safety detection performance, provides preliminary evidence for lookahead risk anticipation, and substantially reduces harmful actions of the coding agent, establishing a foundation for applying mechanistic interpretability to dynamic LLM agent safety. The code is available at: this https URL

[AI-97] Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift

链接: https://arxiv.org/abs/2606.22659
作者: Md Anas Biswas
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 9 figures, 11 tables

点击查看摘要

Abstract:Prompt-injection detectors are deployed as guards: a model scores an input and a downstream system trusts or blocks it on that score. I study the confidence of these scores, not only their accuracy, when the attack distribution shifts away from the clean benchmark on which the operating point was chosen. I evaluate three released detectors, ProtectAI-v2 and two Prompt-Guard-2 checkpoints, at a single source-calibrated threshold that I freeze and transport across five shifts. I report a severity metric S, how confident a detector is on the attacks it misses, alongside the false-negative rate and discrimination. Across every shift and every detector, severity on the missed attacks stays between 0.99 and 1.00 while the false-negative rate ranges from 0.01 to 0.97: when these detectors miss, they miss with near-certainty. All three confidently pass indirect behavior-hijack injection, a blind spot unanimous across two vendors and a fourfold size range. Standard pooled calibration error does not register this; one detector it rates well-calibrated, at 0.06, is miscalibrated at 0.91 on the attacks alone. Run against live models, the missed injections leak the majority of working exploits, passing them at the rate they catch others. A controlled experiment traces the cause to content-keying rather than injection structure, an instruction-tuned model used as a judge shows the same hijack blind spot, and a black-box rewriter exploits the content-keying to manufacture working confident misses, most effectively on the most dangerous attack category. Code and data are public.

[AI-98] Foundation Models for Epileptogenic Zone Identification in Drug-Resistant Epilepsy

链接: https://arxiv.org/abs/2606.22657
作者: Thi Kieu Khanh Ho,Thomas Lai,Petr Klimes,Jan Cimbalnik,Martin Pail,Milan Brazdil,Birgit Frauscher,Narges Armanfard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Keywords: drug-resistant epilepsy, epileptogenic zone, stereo-electroencephalography, signal foundation models, language foundation models

点击查看摘要

Abstract:Accurate identification of the epileptogenic zone (EZ) is essential for seizure freedom after resective surgery in drug-resistant epilepsy, yet seizure freedom rates remain below 50%. We developed EpiiSLM, a dual foundation model system for EZ identification with stereo-electroencephalography (sEEG), by training a signal foundation model on 104,990 minutes of sEEG recordings from the Montreal Neurological Institute Hospital, while leveraging all recordings regardless of surgical outcome and anchoring EZ biomarker extraction on non-epileptic signals. A language foundation model then integrates sEEG-derived outputs with multimodal clinical information to produce interpretable predictions. Under leave-one-patient-out evaluation, EpiiSLM achieved 0.978 contact-level positive predictive value (PPV), outperforming the seizure onset zone(SOZ)-as-EZ baseline by 15.1% (p 0.05), and 100% region-level accuracy; on an external dataset, EpiiSLM achieved 0.857 contact-level PPV. EpiiSLM requires only one night of interictal sleep data, suggesting potential to reduce invasive sEEG monitoring duration and improve surgical outcomes.

[AI-99] Confident but Conflicted: Internal Uncertainty and Cognitive Dissonance Resolution in LLM s

链接: https://arxiv.org/abs/2606.22633
作者: Weihong Qi,Kristina Lerman
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently encounter inputs that disagree with their prior outputs, through user pushback, retrieved documents, or web search results. While the way they resolve such conflicts – a process we frame as cognitive dissonance resolution – has been characterized behaviorally, its connection to internal model uncertainty is not well understood. To study this systematically, we vary persuasion attempts along two dimensions, source authority and evidence quality, across 12 health-science claims of stratified epistemic status. Dissonance can be resolved through persuasion, backfire, or immunity. We introduce Trust Elasticity (TE), an econometrics-inspired measure of how readily a model is persuaded toward conflicting evidence. Across four LLMs, TE varies substantially, while clearly false claims elicit near-zero TE across all models. On two open-weight models, we further find that this variation is associated with two complementary internal uncertainty indicators, Confidence Miscalibration in Qwen and Internal Uncertainty Change in Llama. These results link cross-model behavioral variation to a measurable internal property and point to interventions targeting internal uncertainty as future work.

[AI-100] Federated Learning for Global Carbon Emission Forecasting: A Hybrid Time-Series Approach with Statistical and Neural Models

链接: https://arxiv.org/abs/2606.22618
作者: Attia Qammar,Qazi Haseeb Yousaf,Ali Azam,Ammar Ahmed,Abdenacer Naouri,Tianrui Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Approximately 22 pages, 5 figures, 15 tables. Submitted for journal publication

点击查看摘要

Abstract:Climate change, primarily driven by carbon dioxide (CO2) emissions, requires accurate forecasting tools to support effective mitigation policies and sustainable development strategies. Existing forecasting approaches typically rely on centralized data collection, which is often restricted by privacy regulations and the distributed nature of emission data across countries and industrial sectors. This paper proposes a novel federated hybrid forecasting framework that integrates ARIMA-based trend modeling, GARCH-based volatility modeling, LSTM-Attention temporal representation learning, and XGBoost prediction within a privacy-preserving federated learning environment. The proposed framework enables collaborative learning among distributed clients without requiring the exchange of raw data. Experimental evaluation across 14 clients demonstrates strong forecasting performance, achieving client R2 values between 0.50 and 0.97 with an average of 0.73, RMSE values ranging from 0.06 to 2.35 with an average of 1.21, and MAPE values between 1.5% and 11.3% with an average of 6.5%. The results indicate that the proposed framework provides an accurate, scalable, and regulation-compliant solution for collaborative carbon-emission forecasting.

[AI-101] SkillAudit: From Fixed-Suite Benchmarking to Skill-Centered Assessment

链接: https://arxiv.org/abs/2606.22613
作者: Dexu Yu,Youhua Li,Zhaoyang Guan,Xianhao Lin,Jining Luan,Zihao Rao,Xuanqi Lan,Yang Ran,Bo Lan,Nai-Xin Zhai,Hanwen Du,Junchen Fu,Wenhao Deng,Yongxin Ni,Chunxiao Li
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Project page: this https URL . Code and evaluation artifacts: this https URL

点击查看摘要

Abstract:Agent skills have become a practical way to extend large language model agents, but the growing skill ecosystem still lacks a reliable way to judge whether a skill is worth deploying. Existing evaluation methods remain largely anchored to fixed task suites, assessing skills through performance on predefined tasks and environments. As skill marketplaces expand, this paradigm becomes inadequate: fixed suites can conflate a skill’s marginal contribution with backbone strength and miss its value when tasks fall outside the skill’s intended scope. We introduce SkillAudit, an end-to-end framework for skill-centered assessment that takes an arbitrary agent skill as input and automatically generates a comprehensive, multi-dimensional evaluation report spanning utility, efficiency/cost, and safety. SkillAudit focuses on the skill artifact itself and constructs capability-aligned evaluation tasks directly from the skill package. The generated tasks are conducted in isolated sandbox environments to collect execution evidence, followed by automated checks with LLM-based judging to produce auditable results. To dissect the agent skills, we propose the baseline comparison principle to measure utility and efficiency/cost, and introduce a two-stage detection paradigm combining static semantic analysis with dynamic runtime verification to assess safety risks. After scanning top-ranked real-world skill packages spanning 23 occupational categories, we found that over 7% of skills are at risky status.

[AI-102] PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement

链接: https://arxiv.org/abs/2606.22610
作者: Weiwei Ye,Hangchen Liu,Dongyuan Li,Renhe Jiang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have become capable reasoners and tool users that write and run code and search the literature, which makes automating the research process itself a realistic goal. We present PAPERCLAW, a harnessed multi-agent system that carries a project autonomously, from a field of study to a finished paper. PAPERCLAW curates a domain from a field’s live literature, datasets, and code; brainstorms it into an idea with a pre-registered main-result contract; and drives a stoppable hypothesis map through an iterative propose, test, reflect loop that grows only from measured verdicts and halts once the evidence supports the idea, at which point it writes a venue-compliant paper. A full-lifecycle memory keeps each stage in a single living record, so a long run can be paused, inspected, and resumed without losing context. At the centre is an in-cycle research assistant with research tools and skills: it can drive the whole pipeline on its own, while the same interface lets a person step in at any stage, turning a first autonomous draft into a stronger paper through human-in-the-loop refinement. Throughout, PAPERCLAW keeps its output grounded and checkable, citing only references validated against open scholarly indexes and reporting results that genuinely ran. An evaluation with an LLM judge finds that PAPERCLAW produces strong papers both fully autonomously and with human-in-the-loop refinement.

[AI-103] On the Position Bias of On-Policy Distillation

链接: https://arxiv.org/abs/2606.22600
作者: Yan Xie,Sijie Zhu,Tiansheng Wen,Bo Chen,Yifei Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. However, we discover that not all tokens are created equal: as student rollouts grow longer, they deviate further from the teacher’s distribution, leading to degraded supervision quality at later positions. As a result, OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything. In this work, we provide a principled understanding of this issue through the lens of constrained optimization. Based on these insights, we derive Importance-Weighted On-Policy Distillation (IW-OPD), in which the weight assigned to each token depends on the accumulated discrepancy between the student’s and teacher’s distributions, naturally upweighting earlier tokens and downweighting later ones with larger deviations. We show that IW-OPD converges significantly faster than OPD, with better learning efficiency, and achieves better final performance than standard OPD in both same-size and cross-scale settings, improving performance up to 6.9 points on AIME-2025.

[AI-104] xt2DSL: LLM -Based Code Generation for Domain-Specific Languages

链接: https://arxiv.org/abs/2606.22586
作者: Alexander V. Kozachok,Alexander M. Nazimov,Shamil G. Magomedov
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 14 pages, 4 figures, 5 tables. Accepted at KES 2026 (Knowledge-Based Intelligent Information and Engineering Systems), Procedia Computer Science, Elsevier

点击查看摘要

Abstract:Domain-specific languages (DSLs) are widely used for managing operating system security policies, yet manually authoring rules in such languages demands high expertise and is error-prone. This paper formalises the task of automatic DSL code generation from natural language descriptions - Text2DSL - as a distinct problem class, separate from Text-to-SQL and general-purpose code generation. We introduce the PolkitBench dataset comprising 4,204 verified natural-language-to-Polkit-rule pairs, each validated through a three-level AST-based pipeline. Controlled prompt experiments on two MoE models of different scale and provenance - GigaChat-10B-A1.8B (1.8B active parameters) and Nemotron-3-Nano-30B-A3B (3B active) - demonstrate the critical role of structured context (BNF grammar, API specification, permitted identifier vocabulary) for LLM-based DSL code generation. Across both models, supplying context raises syntactic validity to 98.6-99.4%, structural validity by +9.7 to +35.5 pp, and the CodeBLEU score by +60% to +95%. The consistency of the effect across models of different scale and provenance indicates that, for the Text2DSL class of problems, injecting a formal target-language specification into the prompt context is a robust enabling factor for high-quality generation without model fine-tuning.

[AI-105] From CVE to CWE: Syscall-Based HIDS Generalisation

链接: https://arxiv.org/abs/2606.22581
作者: Alexander V. Kozachok,Stanislav G. Vyugov,Shamil G. Magomedov
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Host intrusion detection systems (HIDS) based on system-call traces are typically trained and evaluated against individual Common Vulnerabilities and Exposures (CVE) instances. In operational settings, however, defenders need to recognise new exploits of an already known type of weakness. We empirically examine whether a one-class anomaly detector trained on the normal behaviour of a set of CVEs that share a Common Weakness Enumeration (CWE) class generalises to a different, unseen CVE inside the same class. Using six scenarios drawn from LID-DS-2021 and grouped into three CWE families (CWE-307 broken authentication, CWE-89 SQL injection, CWE-434 unrestricted file upload), we extract a 66-dimensional Peng-Guo-style feature vector per sliding window and train Isolation Forest and SGD One-Class SVM detectors with normal-only thresholds calibrated to fixed target false positive rates. We define and answer four research questions covering self-detection, asymmetric cross-CVE transfer, the value of a combined CWE-level normal profile, and the effect of feature filtering on transferability. The combined CWE-307 detector reaches F1 = 0.6976 at calibration target FPR = 0.05 (precision = 0.8994, recall = 0.5698), whereas CWE-89 and CWE-434 collapse to F1 = 0.21 under the same protocol. Cross-CVE transfer turns out to be strongly direction-dependent and dominated by the breadth of the source normal profile rather than by the CWE label. We conclude that CWE-level generalisation in HIDS is empirically attainable for some but not all weakness families with current syscall features, and we argue that calibrated FPR is a methodological prerequisite for honest reporting in this setting.

[AI-106] Generative Robust Optimisation

链接: https://arxiv.org/abs/2606.22536
作者: Yuhui Yin,Vassilis M. Charitopoulos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Classical uncertainty sets for robust optimisation impose fixed geometric shapes that cannot represent the complex dependencies present in real-world data. We propose Generative Robust Optimisation (GRO), a framework in which a deep generative model defines the uncertainty set as the image of a neural network decoder over a calibrated latent set, naturally accommodating nonlinear correlations, asymmetry, and multimodality. A five-point evaluation framework (reconstruction fidelity, distribution matching, latent regularity, robust relevance, and computational tractability) provides systematic, model-agnostic criteria for assessing any neural network-based uncertainty set. We instantiate this framework with a Wasserstein Adversarial Autoencoder employing Gaussian mixture model-guided training for latent regularity and constraint-consistency regularisation for robust relevance. Restricting the decoder to ReLU activations enables exact worst-case verification through mixed-integer programming embedding. Extensive experiments on a production planning problem across six uncertainty distributions and six generative architectures, together with a multi-period facility location study, validate the framework and demonstrate that systematic attention to all five criteria yields uncertainty sets that are simultaneously expressive, well-calibrated, and optimisation-tractable.

[AI-107] Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents

链接: https://arxiv.org/abs/2606.22528
作者: Shiyang Chen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLM agents increasingly rely on context compaction, summarization, or eviction to keep long-running sessions within a token budget. We show that this context-management layer is a safety-critical failure surface: in-context governance constraints that agents reliably obey while visible can be silently removed by compaction, causing the same agent to perform prohibited tool actions later in the session. We call this failure mode Governance Decay. We introduce ConstraintRot, a benchmark of long-horizon agent scenarios with deterministic tool-call grading, and measure compaction-induced violations across seven model families. Across 1,323 episodes, violation rises from 0% with the policy in full context to 30% after compaction, reaching 59% for some models; when the constraint survives the summary, violation remains 0%, but when it is dropped, violation reaches 38%. We further study a Compaction-Eviction Attack, in which adversarial in-context content biases the summarizer to omit a legitimate policy, and show that optimized injections defeat every evaluated model. Finally, we propose Constraint Pinning, a simple training-free mitigation that quarantines governance constraints from lossy compaction and restores violation to 0% in our benchmark. These results identify context management as a first-class governance surface for deployed LLM agents.

[AI-108] Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation and Policy Evaluation

链接: https://arxiv.org/abs/2606.22510
作者: Pengfei Li,Mohammad Khalil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:While federated learning enables collaborative modelling on decentralised data, standard methods merely fit historical observations. This purely observational approach is fundamentally insufficient for interventional inference and policy evaluation, as sequential actions dynamically alter future states. We propose \textbfFed-CausalDiff, a federated causal diffusion framework for do-simulation. The architecture decomposes the evolution of the latent state into a global causal score function and a local confounding score function. This design enables \emphdecoupled synchronisation (DSS), where clients aggregate only the shared causal mechanism while retaining site-specific confounders locally to handle heterogeneity. Experiments on four datasets demonstrate that Fed-CausalDiff achieves better ATE and policy-value estimation accuracy, offering a favorable trade-off between communication cost and inference fidelity.

[AI-109] Imagine to Ensure Safety in Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2606.22509
作者: Gregory Gorbov,Artem Latyshev,Aleksandr I. Panov
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work investigates the safe exploration problem in reinforcement learning, where an agent must maximize cumulative performance while simultaneously satisfying safety constraints. This challenge becomes even more pronounced in long-horizon tasks, where existing safe methods face fundamental limitations due to compounding estimation errors and restricted exploration capabilities. To address this problem, we propose a method that combines a learnable world model with two complementary policies a high-level policy and a low-level policy to promote safety at both hierarchical levels. The high-level policy generates intermediate subgoals that bias exploration toward safe regions, while the low-level policy uses imagined rollouts in the learned world model to reduce unsafe behaviors when reaching these subgoals. The proposed method was evaluated on challenging long-horizon navigation and manipulation tasks with high-dimensional action spaces, where it significantly outperforms existing Safe RL baselines in both success rate and strong empirical constraint satisfaction, consistently meeting the prescribed safety budget across seeds, while prior approaches fail to effectively solve these complex long-horizon scenarios.

[AI-110] Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing

链接: https://arxiv.org/abs/2606.22496
作者: Aygün Varol,Katarzyna Kołodziej,Łukasz Sobczak,Michał Romaszewski,Przemysław Głomb,Naser Hossein Motlagh,Mirka Leino,Johanna Virkki
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer a natural-language interface for interpreting Internet of Things (IoT) sensor data in smart environments; however, cloud deployment introduces latency, privacy, and connectivity concerns. Local LLMs can reduce these limitations, but compact edge-deployable models often show weaker numerical reasoning when raw sensor readings are provided directly. This paper investigates whether prompt-side preprocessing can improve the accuracy-latency trade-off of local LLMs for environmental monitoring. We propose a structured prompt construction framework that transforms raw air-quality and thermal-comfort measurements into progressively enriched textual representations: raw sensor values, threshold-aware descriptions, and compact environmental summary flags. The approach is evaluated using indoor Raspberry Pi/BME680 datasets from Tampere University and outdoor air-quality datasets from Helsinki, Katowice, and Warsaw. We construct a binary LLM query dataset covering air quality, thermal comfort, and joint environmental conditions, and evaluate five local and five cloud LLMs across three prompt variants and two inference modes, with and without chain-of-thought prompting. Results show that prompt enrichment substantially improves local-model accuracy. In No-CoT mode, local accuracy increases from 50.9% to 81.7% indoors and from 63.7% to 89.3% outdoors from the raw to the most enriched prompt. Local No-CoT inference is the fastest configuration, with mean latency close to 0.22 s, while CoT substantially increases inference time. These findings suggest that lightweight prompt-side preprocessing can narrow the local–cloud performance gap and support low-latency IoT analytics in smart environments.

[AI-111] Grounded Scaling: Why Agent ic AI Needs Deterministic Environments

链接: https://arxiv.org/abs/2606.22495
作者: Liang Ding,Xintong Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-chain agent execution fails exponentially in environments designed for human tolerance: with per-step determinism \delta 1 , k -step chain success degrades as \delta^k . The AGI-to-ASI scaling debate (Genewein et al., 2026) has so far framed progress as a race between compute growth and a list of frictions (data wall, abstraction barrier, embodied bottleneck, multi-agent trust); we argue that environment determinism is a complementary binding axis cutting across all four, for the broad class of agentic AI tasks whose outcomes are verifiable economically, physically, or through multi-party settlement. Three formal results pin down the regime: a Determinism-Efficiency Bound on chain-task success, a Verifier-Goodharting Floor on flywheel ceilings under imperfect rewards, and a convergence condition for environment-side skill evolution. We operationalise the framework as a Supply Certainty Index (SCI) over five measurable properties, a five-level Determinism Maturity Model (DMM) as adoption ladder, and a falsifiable open-question programme (OQ1-OQ5) with explicit null results that would force retraction. The position is platform-agnostic. We engage three competing positions: sim-to-real sufficiency, alignment sufficiency, and AI-as-normal-technology.

[AI-112] Deep Learning-Based Sign Language Recognition from Videos and Cross-Lingual Translation to Indian Vernaculars

链接: https://arxiv.org/abs/2606.22494
作者: Chandranath Adak,Ramesh Nandipalli
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Sign language is a primary mode of communication for the global deaf and hard-of-hearing community, yet automated tools that recognize sign gestures from video and translate them into natural language text remain limited, particularly for low-resource Indian languages. We present a two-stage deep learning pipeline that (i) classifies short sign language video clips into English word labels using a fine-tuned VideoMAE video transformer, and (ii) translates the predicted English label into Hindi, Telugu, and Bengali using Meta AI’s No Language Left Behind (NLLB-200) multilingual translation model. The classification model is fine-tuned on a 13-class subset of the AI4Bharat Indian Sign Language video corpus from IIT Madras, processing 16-frame clips sampled uniformly from each video at 224 x 224 resolution. Under a small-scale academic setting (13 classes, 197 clips, 80-20 split), the fine-tuned model reaches 99% training accuracy and 78% validation accuracy after 15 epochs. We provide a per-class breakdown via a confusion matrix and classification report, identify the dominant failure modes (confusable adjective pairs such as ugly, deaf, blind, hat, and dress), and describe a Streamlit-based inference demo that takes a user-uploaded video and returns the predicted English label alongside its Hindi, Telugu, and Bengali translations. We discuss the scope, limitations (small label set, isolated-word rather than continuous signing, single-signer style sensitivity, ambiguity of single-word machine translation), and directions for future work, including expanding to sentence-level generation and a larger vocabulary. Code is released to support reproducibility.

[AI-113] SCOPE: Evolving Symbolic World for Planning in Open-Ended Environments ICML2026

链接: https://arxiv.org/abs/2606.22488
作者: Yundaichuan Zhan,Minghe Gao,Zhongqi Yue,Wendong Bu,Wenqiao Zhang,Guoming Wang,Jisheng Dang,Juncheng Li,Siliang Tang,Yueting Zhuang
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Recent works have explored integrating Vision-Language Models (VLMs) with classical planners that rely on symbolic representations of planning problems to generate long-horizon plans for complex embodied tasks. However, in open-ended environments, these symbolic representations obtained from perception are often incomplete, leading to suboptimal performance. To address this, we introduce SCOPE, a self-adaptive symbolic planning framework that supports refining action plans and evolving the symbolic world, i.e., the symbolic representations of open-ended environments. SCOPE comprises two synergistic modules: a Symbolic Execution Simulator (SESim) that conducts symbolic validation and real execution of action plans, leveraging the feedback to refine the plans and evolve the symbolic world; and a Self-Adaptive Symbolic Memory (SASMem) that further distills feedback into evolving symbolic knowledge to enhance long-horizon planning and modeling of the symbolic world. Experiments in open-ended environments show that SCOPE significantly improves the completeness of the symbolic world, the success rate of plans under environment perturbations, and cross-task grounding and adaptability across diverse embodied scenarios.

[AI-114] All Green Still Broken: Real-Flow Verification Lessons from an LLM -Integrated Multi-Market Web Application

链接: https://arxiv.org/abs/2606.22475
作者: Muhammad Bilal(Technical University of Munich),Ali Hassaan Mughal(Independent Researcher)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 2 tables. Preprint of a manuscript submitted to IEEE Software

点击查看摘要

Abstract:Modern web applications increasingly combine three ingredients that are hard to test: output from large language models, multi-market internationalization, and browser-driven front-ends over external data sources. We report on a production rental-search assistant whose automated suite grew to 1,553 test cases in six weeks. The suite passed continuously, yet user-facing defects continued to reach production. We studied all 252 bug-fix commits in the project and classified each by the boundary, or seam, it escaped through. About 44 percent of the fixes fall in four seams that component-level unit tests cannot observe: the live browser runtime, the non-default market, the end-to-end flow, and the whole-system level. A fix without a guard at the seam let one defect ship twice. We present the four-seam framework, the measured defect distribution, and the practices we adopted, including a simple way for a team to find the seam that carries the most fixes.

[AI-115] PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLM s

链接: https://arxiv.org/abs/2606.22470
作者: Tehreem Javed,Shumaim Fatimah,Masooma Bakhtiari,Gibrail Islam,Mehwish Fatima
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often encounter conflicting prompts, although current instruction following benchmarks assess those meta-instructions in isolation, limiting the insights about how models process conflicting instructions. We introduce a framework \textitPRIME(\textitPrompt Resolution under Incompatible Meta-Instructions Evaluation) to analyze behavior of LLMs when provided with conflicting instructions. \textitPRIME purposefully produces calibrated conflicts across response length, output format, and reasoning; classifying model responses with a deterministic behavioral taxonomy. We are evaluating five instruction tuned open weight LLMs in two distinct settings, balanced and naturally distributed. The conclusion we reach upon analysis is that conflict type is more significant in affecting behavior than model scale, and various failure modes across different categories of conflict. Our findings emphasize the value of developing conflict awareness and suggest ability of LLM to follow instructions cannot be assessed through isolated constraints alone.

[AI-116] Self-Evolving Cognitive Framework via Causal World Modeling for Embodied Scientific Intelligence

链接: https://arxiv.org/abs/2606.22449
作者: Yi Yu,Tetsunari Inamura
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 18 pages

点击查看摘要

Abstract:Current embodied world models are primarily optimized for predictive objectives, limiting their ability to generalize under distribution shifts and reason systematically about unseen situations and hypothetical interventions. We argue that embodied intelligence should move beyond predictive world modeling toward self-evolving cognitive systems that continually construct and refine internal causal representations through interaction with the environment. To this end, we propose a self-evolving cognitive framework via causal world modeling for embodied scientific intelligence, which integrates three complementary components: causal world modeling, intervention-driven causal reasoning, and continual cognitive refinement. The proposed framework continuously revises and expands its internal causal world model through causal discovery, intervention-driven feedback, and counterfactual reasoning, supporting continual cognitive refinement and enabling cognition itself to evolve over time. Furthermore, we reinterpret embodied interaction not merely as a means of trajectory optimization, but as an epistemic process for causal hypothesis generation, intervention-driven experimentation, and continual knowledge acquisition. This work provides a conceptual and theoretical foundation for a transition from predictive intelligence toward epistemic intelligence, in which intelligence emerges through the continual construction, revision, and refinement of causal world models via interaction with the environment. Accordingly, an intervention-driven causal-epistemic benchmarking paradigm is suggested for evaluating self-evolving embodied scientific intelligence.

[AI-117] A Differentiable Atari VCS:A Complex Fully Known Ground Truth for Explainable AI AAAI2027

链接: https://arxiv.org/abs/2606.22447
作者: Andreas Maier,Siming Bayer,Patrick Krauss
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submission for AAAI 2027

点击查看摘要

Abstract:Explanation requires ground truth: to verify an account of a system we must know its inner functioning-just what is missing where explainable AI (XAI) is most needed. Systems we can study fall into two camps. Simple, procedural one-decision trees, rule lists, sparse linear models-have a known but trivial mechanism, so explaining them tests nothing; genuinely complex ones-deep networks, real-world tasks-need XAI but have no ground-truth inner functioning, so an explanation can be plausible, confident, and wrong with no way to tell. We remove this dichotomy with a study object both genuinely complex and fully specified-inspectable by construction-and, so gradient methods apply, fully differentiable. We reimplement the Atari 2600 Video Computer System (VCS)-a real computer architecture, and the cradle of deep reinforcement learning-as two independent end-to-end differentiable emulators in Julia (jutari) and JAX (jaxtari), each validated bit-for-bit against xitari. Both reproduce xitari on all 64 supported Arcade Learning Environment (ALE) games: 64/64 byte-identical RAM and 64/64 pixel-identical screens. Treating the cartridge ROM as a weight tensor, RAM as a soft tape, and control flow as gates, we prove the differentiable (soft) execution equals the original (hard) one bit-for-bit in the forward pass at any finite temperature, while exposing surrogate gradients where the bit logic has none. The JAX port also opens a GPU path: batched differentiable rollouts reach millions of environment-steps/s on one commodity GPU. The system was built in roughly 137 active hours over 29 calendar days, much of it written autonomously by coding agents. This paper builds and validates the foundation, showing-theoretically and in a qualitative gradient study-that gradient-based XAI on it is feasible. Both ports’ full code is available under the MIT license at this https URL.

[AI-118] Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

链接: https://arxiv.org/abs/2606.22442
作者: Xiangyuan Xue,Yang Yu,Yan Gao,Junyan Wang,Bin Chen,Lingyan Ruan,Ting Dang,Hong Jia
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pulmonary embolism (PE) is a high risk cardiopulmonary condition whose management requires both timely diagnosis and reliable assessment of future clinical risk. Because PE care routinely combines computed tomography pulmonary angiography (CTPA), radiology interpretation, and longitudinal electronic health record (EHR) evidence, it provides a clinically meaningful setting for evaluating compact multimodal language models. In this work, we build a benchmark using efficient multimodal large language models (MLLMs) on INSPECT, a multimodal PE dataset containing 23,248 CTPA studies from 19,402 patients. We formulate eight diagnostic and prognostic tasks as structured clinical question answering problems and evaluate on typical efficient MLLMs under CTPA-Only, EHR-Only, and CTPA+EHR settings with zero-shot and few-shot prompting. Results show that Gemma4 E4B and Gemma4 E2B perform more strongly when EHR evidence is available, especially under CTPA+EHR input. Task level analysis further shows that PE diagnosis achieves higher performance than prognostic tasks, particularly readmission prediction. These observations suggest that compact multimodal models have the great potential in early stage PE risk detection and explanation.

[AI-119] SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery

链接: https://arxiv.org/abs/2606.22425
作者: Bin Cao
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Machine-learned interatomic potentials now enable efficient atomistic evaluation for interactive materials discovery, yet closed-loop crystal search methods remain fragmented across bespoke pipelines for editing, relaxation, scoring, constraints, and bookkeeping. We introduce SciVerseGym, a Gymnasium-compatible environment for sequential crystal discovery that frames crystal design as a Markov decision process. Agents observe an atomistic structure, apply chemically meaningful edits, and receive feedback from a configurable evaluator. SciVerseGym supports local and global actions, including elemental substitution, lattice perturbation, atomic displacement, vacancy creation, and atom insertion, along with configurable chemical spaces, structure pools, atomistic and graph-based observations, custom rewards, optional relaxation, and stability or phonon-related diagnostics. Each step applies an edit, evaluates the candidate using a machine-learned interatomic potential or any ASE-compatible calculator, and returns the standard (obs, reward, terminated, truncated, info) tuple. By decoupling agent logic from materials infrastructure, SciVerseGym provides an open, reproducible, and extensible testbed for reinforcement learning, Bayesian optimization, evolutionary search, and language-agent workflows in closed-loop crystal discovery. Code is available at: this https URL.

[AI-120] Code Isnt Memory: A Structural Codebase Index Inside a Coding Agent

链接: https://arxiv.org/abs/2606.22417
作者: Ishaan Bhola,Adithyan Krishnan,Sravanth Kurmala,Mukunda NS
类目: Artificial Intelligence (cs.AI)
备注: Code and data: this https URL

点击查看摘要

Abstract:Coding agents now interleave LLMs with retrieval over the working repository, and retrieval implementations vary widely across deployed harnesses. Inside a fixed coding-agent harness on a fixed model, does adding a structural codebase index actually change cost or resolve? We ran three arms (the harness with the index, the same harness without it, and an agentic-grep comparator) on SWE-PolyBench Verified and SWE-bench Pro with Claude Opus 4.7 held fixed throughout, across three seeds, inside a leak-audited per-task sandbox. The within-harness ablation produces a large localization gain and a statistically separated resolve gain, with no cost penalty per cell and lower cost per solve. The cross-harness check shows that the index does not regress against an agentic-grep baseline on resolve or localization, again at no cost penalty. We release the per-cell exclusion ledger, the leak-audit script, the localization extractor, and the results database. The deployment question for a structural codebase index is thus not whether it is too expensive to run (across seeds, the index lands at a lower /solved than agentic grep) but whether the workload includes multi-file changes where structural ranking pays off.

[AI-121] MetaPS: Adaptive Programmatic Strategy Selection for Market Agents

链接: https://arxiv.org/abs/2606.22385
作者: Jiaxiang Chen,Aotian Luo,Zhouyi Zheng,Weiyi Huang,Chi Zhang,Zenglin Xu
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:No single market strategy always wins: momentum, mean reversion, risk control,and event-driven rules can each succeed or fail as market conditions this http URL than asking large language models to directly generate market actions,we study an executable decision paradigm where an agent selects from a library of programmatic strategies, each implemented as a code module mapping market observations to this http URL propose \textbfMetaPS, a simulation-guided framework for adaptive programmatic strategy selection. MetaPS rolls out candidate strategies in simulated or backtested markets, identifies states where particular strategies lead to better future outcomes, and converts these state–strategy pairs into supervised fine-tuning data. During inference, the simulator is no longer queried: MetaPS observes only the current market state and candidate strategy context, selects a suitable strategy program, and the selected program produces the final action. Experiments on multi-stock trading and a controlled goods-exchange sandbox show that MetaPS consistently improves across model scales from 0.8B to 9B parameters. It outperforms fixed-strategy baselines, direct decision-making agents, and prompted API-based LLM agents; in several settings, compact fine-tuned models even surpass stronger API models. These results demonstrate that market simulations can provide scalable and targeted supervision for learning adaptive, interpretable, and executable strategy selection.

[AI-122] Reference-Free Assessment of Physical Consistency in World Model-based Video Generation CVPR2026

链接: https://arxiv.org/abs/2606.22363
作者: Yun Oh,Sukmin Yun
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to the 2nd 3D-LLM/VLA Workshop, CVPR 2026

点击查看摘要

Abstract:We introduce reference-free measures for evaluating the physical consistency of generated videos, combining relative and absolute approaches to assess fidelity. Although tools like WorldGym or WorldEval enable robotic simulation via video generation, physical fidelity gaps often prevent these environments from accurately reproducing real-world task success rates of VLA models. Unlike existing evaluation methods, which require costly human voting (Elo) or unavailable ground-truth references (FVD), our approach utilizes DROID-SLAM and SEA-RAFT to quantify physical inconsistencies, motivated by WorldScore. Videos filtered using our relative consistency assessment show an improvement in task success rates of over 8%, effectively narrowing the simulation-to-reality gap. Furthermore, our absolute assessment enables spatio-temporal localization, providing visualization of when and where physical artifacts occur.

[AI-123] On the Sparsity-Storag e-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

链接: https://arxiv.org/abs/2606.22352
作者: Zihui Zhao,Yuanbo Tang,Yang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dictionary learning has long been studied from both optimization and probabilistic perspectives. While formulations with element-wise sparsity regularization (e.g., L1-based sparse coding) admit well-established probabilistic interpretations, many structured variants that impose global constraints lack a clear and tractable generative view. In this paper, we revisit a class of practically effective yet theoretically under-explored dictionary learning methods that impose a simple global regularization on the number of activated dictionary atoms, which we term parsimoniously activated dictionary learning (PADL). We show that PADL admits an equivalent formulation as maximum a posteriori estimation under a structured generative model, with auxiliary latent variables that govern global activation patterns. This formulation allows us to derive generalization guarantees that are difficult to obtain under the original formulation. More importantly, it yields an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy, enabling data-driven estimation of optimal hyperparameters. Based on this connection, we develop an efficient and interpretable PADL algorithm that eliminates manual hyperparameter tuning, achieving improved reconstruction performance under comparable sparsity levels on visual benchmarks. We further demonstrate its practical utility in accelerating inference for vision-language models.

[AI-124] Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance ECML2026

链接: https://arxiv.org/abs/2606.22350
作者: Hanping Zhang,Adam Koziak,Yuhong Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ECML 2026

点击查看摘要

Abstract:Reinforcement Learning (RL) has been widely applied to sequential decision-making, yet it often suffers from poor sample efficiency due to costly interactions with the environment. A limited line of recent work has started exploring improving RL efficiency by leveraging external knowledge expressed in natural-language instructions. However, the few existing approaches typically treat the entire instruction as a single conditioning input, failing to account for the stage-dependent nature of language guidance, especially in complex environments. In this paper, we propose \emphHierarchical Reinforcement Learning with Language Instructions (HRLLI), a hierarchical RL framework that explicitly models natural-language instructions as dynamically selectable semantic guidance during decision-making. HRLLI decomposes instructions into a set of piecewise guidance elements, where each instruction piece may become relevant at different stages of interaction with the environment. A novel hierarchical RL policy structure is then formulated in a \emphSelect-to-Act paradigm: a high-level semantic policy acts as a guidance selector that selects the most relevant instruction piece to the current state to guide the low-level agent’s decision, while a low-level policy executes environment actions conditioned on the selected guidance. The two-level policies are learned simultaneously to maximize augmented expected returns from interactions with the environment. This design enables the agent to adaptively ground language instructions into stage-specific decisions during interaction. Experiments on the instruction-intensive RTFM benchmark show that HRLLI consistently outperforms strong instruction-conditioned RL baselines, demonstrating that explicitly modeling adaptive instruction selection significantly improves the effectiveness of RL.

[AI-125] Benchmarking Robot Memory Under Interference

链接: https://arxiv.org/abs/2606.22338
作者: Soumil Rathi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Robots deployed in realistic settings will accumulate experience across many sessions and tasks over their deployment. The robot’s tasks may often require it to remember information from multiple sessions ago, making long-context robot memory important for real-world deployments. However, most robot-memory benchmarks today are based on single episodes or a short context. To measure how current robot memory systems perform on longer sessions with more distractions, we introduce RoboMME-Interference, a cross-session benchmark built on RoboMME. For each query episode, we construct a session history using the query’s relevant prior demonstration followed by a controlled number of unrelated sessions, which we provide to the VLA as memory and measure accuracy. Running RoboMME’s released memory-augmented \pi_0.5 variants unmodified through this benchmark, we find that while perceptual memory variants improve success when given the history without any distractors, they decay strongly and steadily as unrelated sessions accumulate. With this release, we emphasize the importance of long-context memory and robustness to interference and show that current systems largely fail on such capabilities. The project page, videos, code, and data are at this https URL.

[AI-126] Hypothesis-Driven Skill Optimization for LLM Agents

链接: https://arxiv.org/abs/2606.22330
作者: Fangxin Shang,Yehui Yang
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:External skills can improve action-oriented LLM agents without changing model weights, but persistent skill updates are risky when they are distilled from sparse or noisy trajectories. A plausible reflection may encode a useful procedure, a spurious shortcut, or a rule that the target executor cannot reliably follow. We propose Hypothesis-Driven Skill Optimization (HDSO), a train-free framework in which both the skill curator and the agent executor are frozen inference endpoints. The curator observes executor traces, proposes a falsifiable hypothesis with an explicit validation plan, instantiates the hypothesis as a candidate skill package, validates the package through paired control/treatment executions, reviews behavior differences, and consolidates only supported candidates into an approved repository. The executor consumes approved skills through progressive disclosure, preserving the executor-only path when no skill is selected. On ALFWorld, HDSO improves executor-only baselines by +6.9 Avg. SR points for Qwen3-8B and +4.0 points for Qwen3.6-27B. Under 20% randomly flipped success/failure feedback during skill discovery and validation, HDSO preserves a +7.1-point gain for Qwen3-8B. Transfer and heterogeneous-pair diagnostics further show that validated repositories can be useful beyond the run that produced them, but cross-model curation succeeds only when curator diagnosis, executor capability, and validation evidence align. HDSO provides an auditable skill lifecycle for frozen action agents rather than an unconstrained memory accumulation procedure.

[AI-127] Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

链接: https://arxiv.org/abs/2606.22327
作者: Li Kong,Qi Qi,Yinyu Ye,Zijie Zhou
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The explosive demand for interactive Large Language Model serving has highlighted the management of the Key-Value cache’s dynamic memory footprint as a critical area for performance optimization in inference engines. Modern inference systems overwhelmingly rely on time-centric scheduling heuristics, such as Shortest Job First. However, their theoretical optimality is rooted in traditional schedule modeling, failing to capture the highly dynamic, 2D spatio-temporal geometric growth specific to LLM inference mechanisms. To resolve this, we propose the geometry-aware online scheduling by introducing the Smallest Volume First (SVF) algorithm and its highly efficient variant, 1-bit SVF. Theoretically, we provide a rigorous mathematical foundation for our approach. Utilizing a novel proof methodology, we tighten the worst-case competitive ratio ( \textCR \le 48 \rightarrow \textCR \le 5 ) for SVF with known output lengths. Building upon this core breakthrough, we complete a comprehensive theoretical taxonomy analyzing our algorithms across different traffic scenarios and information availability. Practically, we seamlessly integrate our approach as a plug-and-play layer in vLLM. Extensive evaluations on Llama-3.1 models demonstrate comprehensive performance gains: SVF delivers strong reductions in both average and tail latency, while 1-bit SVF, with merely a single bit information, achieves competitive throughput and latency. This work establishes a theoretically sound and empirically proven approach for resolving memory-constrained scheduling in modern LLM deployments. To facilitate future research, our code is available at this https URL.

[AI-128] All Routes Lead to Collapse

链接: https://arxiv.org/abs/2606.22325
作者: K. R. Balasubramanian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Attention sinks, representation collapse, and norm stratification are treated as transformer-specific pathologies. We show they are not specific to attention: they are what content-based routing does under a fixed similarity metric. We give a reframing identity: softmax attention is Boltzmann-weighted aggregation over Euclidean distances with constant key norms, so its score omits a -|k|^2 term and is blind to key magnitude. This predicts that any router whose metric is ill-matched to its representations should compensate, by concentrating its routing and collapsing the routed representations. We test it on routers that score and aggregate over different axes: softmax attention over tokens (nine pretrained transformers), graph attention over nodes, a selective state-space model and a recurrent mixer over time, and learned residuals over depth. All develop the same signature, and two within-model ablations show it is caused by the routing mechanism rather than by incidental dynamics. The form is contingent, set by the strength of the positional brake each router carries alongside its content score; we sweep that brake and move the onset across its whole range. The mechanism is not contingent, and it does not require norm stratification: a router with norm-normalized keys concentrates just the same. We do not claim these models implement Riemannian geometry; the geometric view is a diagnostic that names the inadequacy of the flat, norm-blind metric.

[AI-129] Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLM s Beyond the Base Model

链接: https://arxiv.org/abs/2606.22317
作者: Pengxiang Cai,Tianchen Fang,Xiaohan Li,Qingyuan Zeng,Guocong Li,Jintai Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is widely viewed as a promising path toward continuously improving large language models. Recent works, however, suggest that mainstream RLVR often reallocates sampling probabilities among trajectories already present in the base model: it can improve sampling efficiency, reflected by higher pass@1 scores, but yields limited gains, and can even decrease pass@k scores when k is large, and therefore may fail to expand the base model’s reasoning capacity boundary. In this paper, we present a boundary-aware Curriculum RL approach to move beyond the base model’s reasoning capacity boundary. Our approach first uses pass@k sampling to locate the current reasoning capacity boundary, then applies targeted teacher guidance to examples near or beyond that boundary, and finally uses RL to consolidate the newly introduced reasoning patterns. Across Qwen, Llama, and DeepSeek base models, boundary-aware Curriculum RL improves both pass@1 scores and pass@256 scores, with pass@1 reflecting one-attempt performance and pass@256 serving as an empirical proxy for the reasoning capacity boundary. In our experiments, average pass@256 improves by 9.8 percentage points over the base models and by 10.3 percentage points over Vanilla RLVR. These results suggest that boundary-aware Curriculum RL can provide a scalable route for LLMs to continuously improve beyond the base model’s empirical reasoning capacity boundary.

[AI-130] Enhancing Protein Representation Learning via Manifold Restore Mixing

链接: https://arxiv.org/abs/2606.22307
作者: Yizhou Dang,Chuang Zhao,Lianbo Ma,Guibing Guo,Xingwei Wang,Zhu Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data augmentation (DA) has been proven to be an effective means for improving protein representation learning (PRL) by generating additional training samples. Although mainstream perturbation- and sampling-based augmentation methods can produce data containing sufficient variations, they carry the risk of disrupting the protein structure and function. Some crafted protein homology modeling tools can generate conformations, but reduce structural diversity. The above dilemmas lead us to a question: Can we restore the disrupted structure caused by DA operations, providing data with both the original structure and diverse variations? In this work, we first analyze and empirically reveal the structure defect and performance degradation issues of existing DA methods. Based on the findings, we propose a simple yet effective DA method, Manifold Restore Mixing (MRM), for protein representation learning. Specifically, inspired by manifold mixup, we mix the hidden representations of original and augmented protein data to generate new samples that restore structural information lost in DA while introducing diverse variations. Furthermore, we develop a sample difficulty scheduler that adjusts the beta distribution in mixup to provide models with progressively challenging mixed samples during training, which improves the final performance. Comprehensive experiments on various PRL backbones and downstream tasks demonstrate the effectiveness and generalization of our method. The complete code and weights will be released upon acceptance. We provide a implementation at this https URL.

[AI-131] Leverag ing Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT GPT -3.5 and GPT-4

链接: https://arxiv.org/abs/2606.22306
作者: Saman Pordanesh,Benjamin Tan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of software development, code stylometry analyzing unique stylistic signatures of programmers plays a crit-ical role in authorship attribution and cybersecurity. Recent advancements in artificial intelligence, particularly Large Language Models (LLMs) like GPT-3.5 and GPT-4, have introduced new dimensions to this field, challenging traditional stylometry techniques. This study investigates the effectiveness of LLMs in altering code stylometry while preserving functionality and evaluates the impact of various prompt engineering strategies. Through comprehensive experiments, we assess how well these models can obscure stylistic signatures to avoid detection by a Random Forest classifier trained for authorship attribution. The results reveal significant differences in effectiveness between single-shot and multi-shot methods and highlight the importance of detailed, structured prompts. Additionally, functionality preservation checks demonstrate the challenges in maintaining code integrity post-modification. This research provides critical insights into the robustness of authorship attribution techniques against advanced AI capabilities, informing future cybersecurity and software engineering developments

[AI-132] SCENIC: Semantic-Conditioned Edge-Aware Neural Framework for Structured IoT Command Generation

链接: https://arxiv.org/abs/2606.22296
作者: Luke Ztz Hu,Hongbing Lang,Songping Mai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IEEEtran journal format; 3 figures, 8 tables

点击查看摘要

Abstract:Edge Internet of Things (IoT) agents are often constrained by memory capacity, privacy requirements, communication latency, and recurring inference cost. Current smart-home assistants commonly rely on API-level command interfaces or cloud-based language models that remain difficult to deploy on edge devices. This paper addresses edge IoT command generation as a many-to-one structured output task, where multiple natural-language instructions map to the same canonical command string for deterministic smart-home parsing. To support this setting, we propose Semantic-Conditioned Edge-Aware Neural Framework for Structured IoT Command Generation (SCENIC), an end-to-end framework covering model architecture selection, Smart Home Instruct data generation, triplet-loss contrastive supervised fine-tuning, pruning and quantization, and deployment-oriented export. We evaluate sub-0.2B-scale transformer backbones, which are, to the best of our knowledge, among the smallest language-model backbones studied for edge IoT structured command generation. On Smart Home Instruct-Bench, the strongest dense decoder-only row reaches 99.0% EM@1, while the encoder-decoder model retains stronger high-sparsity behavior. A representative pruned INT8 encoder-decoder export preserves 91.0% EM@1 and 99.0% EM@5 while reducing exported model size by 25.38%. TensorRT profiling of the NVIDIA 2:4 sparse encoder export further shows up to 1.8x encoder-component speedup, indicating that the selected encoder-decoder deployment path can retain structured command accuracy under edge-oriented compression while hardware acceleration evidence remains component-level. The SCENIC code and experimental artifacts are open sourced to support reproducibility.

[AI-133] Active Sensing and Deferred-Decision Trajectory Optimization for Robust Target Identification

链接: https://arxiv.org/abs/2606.22277
作者: Farbod Siahkali,Mengxue Hou,Vijay Gupta
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in IEEE Control Systems Letters (L-CSS), 2026. 6 pages

点击查看摘要

Abstract:We study trajectory optimization in mobile sensing systems that must identify which member of a finite candidate set is the true target, while maintaining reachability to all potential candidate targets, under resource constraints. Deferred-Decision Trajectory Optimization (DDTO) addresses this setting by computing trajectories that reach individual targets but remain coincident for as long as possible before separating toward different targets. We propose Active-Sensing DDTO (AS-DDTO), which extends DDTO by adding a trajectory-dependent information-acquisition term to the planning objective. The resulting planner maintains reachability to candidate targets while biasing the coincident portion of the trajectories toward regions that enable earlier target identification. The framework supports Bayesian updates and conformal candidate-set updates for distance-dependent sensing. We derive a mixed-integer conic reformulation and provide guarantees on recursive feasibility, belief concentration, and fixed-time coverage for the raw conformal candidate set. Numerical simulations show improved target identification compared with standard DDTO under distance-dependent sensing uncertainty and limited sensing budget.

[AI-134] From Handcrafted Features to Functional Edge Learning: Evolution of EEG Seizure Detection Frameworks

链接: https://arxiv.org/abs/2606.22258
作者: Sepideh Kheirollahi,Mohammad Rasoul Roshanshah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) analysis remains the clinical gold standard for epilepsy diagnosis and seizure detection. While Deep Learning (DL) has significantly advanced automated EEG interpretation, its transition from controlled experimental settings to routine clinical deployment is severely bottlenecked by fundamental architectural flaws. Standard DL models operate as opaque black-boxes lacking clinical interpretability, demand massive amounts of balanced annotated data, and incur steep computational costs incompatible with resource-constrained wearable or implantable neuromodulation devices. This paper presents a comprehensive review of these prevailing limitations and explores Kolmogorov-Arnold Networks (KANs) as a emerging paradigm for EEG-based seizure detection. By replacing the fixed activation functions of traditional neurons with flexible, learnable functions along the network’s connections, KANs bridge the critical gap between predictive accuracy and mathematical transparency. We systematically analyze how KAN architectures resolve the shortcomings of traditional DL-based models by offering exceptional parameter efficiency, inherent interpretability for physician trust, and robust performance under data scarcity. Ultimately, this review establishes KANs not merely as an incremental algorithmic update, but as a fundamental paradigm shift necessary to actualize next-generation, patient-specific, and thoroughly transparent clinical EEG monitoring systems.

[AI-135] On the Expressive Power of Weight Quantization in Large Language Models

链接: https://arxiv.org/abs/2606.22249
作者: Shao-Qun Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, weight quantization that encodes the learnable parameters of large language models in an n -bit format has garnered significant attention due to its potential for model compression and inference acceleration. Many practical techniques have been developed; however, the theoretical understanding of many aspects, especially the approximation and degradation of expressive power as the number of quantization bits decreases, remains unclear. In this paper, we provide a theoretical investigation into the expressive capability of large language models relative to the number of quantization bits. We argue that 1.58-bit is the limiting precision for weight quantization by establishing the universal approximation and expressive collapse properties of weight-quantized models with respect to the number of quantization bits. Additionally, we confirm that weight quantization leads to expressive degradation, in which the expressive capacity of weight-quantized models degrades polynomially as the number of quantization bits decreases. These theoretical findings provide a solid foundation for advancing weight quantization in the context of scaling laws and shed insights for future research in model compression and inference acceleration.

[AI-136] Quantifying Theoretical AI Alignment Guarantees: Receiver-Utility Bounds in Bayesian Persuasion

链接: https://arxiv.org/abs/2606.22226
作者: Eric Yachbes,Eva Tardos
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 12 pages, EC 2026 Poster and EC 2026 Incentive-Based AI Alignment Workshop Poster

点击查看摘要

Abstract:Misalignment can change how information moves from an AI agent to a human user. We model this as an information advantage: the AI agent observes the world state, while the human receiver only knows a prior and must act after seeing the agent’s signal. A strategic AI sender may withhold evidence or garble information in order to steer the human’s decision. We ask how much useful information can still reach the human when the AI optimizes a misaligned objective. We study a Bayesian persuasion model in which the world state is a bit string, the human receiver wants to guess the bits correctly, and a single AI sender wants the receiver to guess as many bits as possible as 1 . For a prior \mu , let R_0(\mu) be the receiver’s utility from using only the prior, and let R_\max(\mu) be the largest receiver utility among signaling schemes that are optimal for the sender. We prove R_\max(\mu)/R_0(\mu)\leq 3/2 . This bound improves for priors close to the independent product prior with the same marginals: if \mu(x)\geq (1-\eta)\pi_\mu(x) for every state x , then R_\max(\mu)\leq R_0(\mu)+\eta n . We also give a six-bit prior for which R_\max(\mu)/R_0(\mu)=39/315/4 , so no universal 5/4 bound is possible.

[AI-137] An Analysis of Untrained Deep Reservoir Networks for Audio Surveillance

链接: https://arxiv.org/abs/2606.22218
作者: Corrado Baccheschi,Patrizio Dazzi
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted paper for AVSS 2026

点击查看摘要

Abstract:In this paper, we investigate untrained recurrent models from the Reservoir Computing (RC) paradigm for audio surveillance, focusing on bidirectional Echo State Networks with different depths, from shallow to deep configurations, for emergency sound event detection. We evaluate these models on the MIVIA Audio Events dataset in a multiclass setting across different Signal-to-Noise Ratio (SNR) levels, with the goal of assessing the trade-off between depth, recognition performance, and computational efficiency. We compare the proposed architectures against fully trained recurrent and convolutional-recurrent baselines, namely Bidirectional Long Short-Term Memory networks (BiLSTMs) and Convolutional Recurrent Neural Networks (CRNNs). Results show that deep and shallow reservoir-based models achieve competitive recognition rates, with deeper variants being more robust in highly noisy conditions and shallower ones offering the most favorable efficiency profile, particularly on edge devices such as the NVIDIA Orin. In addition, the proposed approach remains robust across different input representations, including log-Mel spectrograms and MFCCs with varying resolutions. These findings highlight untrained reservoir architectures as a promising solution for resource-constrained audio surveillance scenarios.

[AI-138] Sequential Minimal Optimization Algorithm for One-Class Support Vector Machines With Privileged Information

链接: https://arxiv.org/abs/2606.22210
作者: Andrey Lange,Dmitry Smolyakov,Evgeny Burnaev
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 19 pages, 4 figures. Author-prepared preprint of the article published in IEEE Access. The published article is licensed under CC BY-NC-ND 4.0

点击查看摘要

Abstract:One of the powerful techniques in data modeling is accounting for features that are available at the training stage, but are not available when the trained model is used to classify or predict test data – the Learning Using Privileged Information paradigm (LUPI). Sequential Minimal Optimization (SMO) methods have been developed for supervised Support Vector Machines (SVM), unsupervised one-class SVM, and SVM with privileged information (SVM+). The missing brick in this research has long been a one-class SVM with privileged information (OC-SVM+). In this paper, we propose an SMO algorithm for OC-SVM+ that significantly outperforms non-sequential algorithms for training the OC-SVM+ model. Its finite-time convergence is established. The experiments show how privileged information affects a descriptive domain in the space of original features. Comparative benchmark tests demonstrate that our algorithm is superior over interior point algorithms.

[AI-139] Neural Conjugate Aggregation: Identifiable Unsupervised Multi-Sensor Regression under Heterogeneous Sensor Bias

链接: https://arxiv.org/abs/2606.22200
作者: Muhammed Faruk Aytin,Zehra Demir,Alper Ünal,Julian Marshall,Gözde Ünal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:We study regression-based data fusion under uncertainty, where multiple noisy and biased measurement sources are available but ground-truth labels are absent during training. This setting arises in sensor networks, simulation ensembles, and scientific monitoring systems where supervision is costly or infeasible. We propose the Neural Conjugate Aggregation Model (NCAM), a hierarchical Bayesian framework that combines neural networks with conjugate Gaussian inference for unsupervised multi-source fusion. NCAM learns source-specific bias and reliability conditioned on contextual covariates, yielding an analytically tractable posterior over a latent target variable with decomposed epistemic and aleatoric uncertainty. Structural non-identifiability is resolved through sensor anchoring and variance regularization, enabling stable and interpretable posterior aggregation. To complement Bayesian uncertainty with finite-sample guarantees, we integrate locally adaptive Monte Carlo conformal prediction, producing heteroscedastic prediction intervals with coverage guarantees under exchangeability assumptions. Experiments on synthetic and real-world air-quality datasets demonstrate improved predictive accuracy and well-calibrated uncertainty compared to unsupervised baselines, including mean aggregation, probabilistic PCA, and Kalman filtering.

[AI-140] L20-Edu-135M: An Auditable Single-GPU Study of Data-Efficient Small Language Modeling

链接: https://arxiv.org/abs/2606.22189
作者: Yin Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 tables. Model available at this https URL

点击查看摘要

Abstract:Small language models are cheap to serve and feasible on local hardware, but strong public 135M-class systems are commonly trained with hundreds of billions to trillions of tokens on large clusters. We study a sharply resource-constrained regime: a complete 134.5M-parameter language-model pipeline executed on one NVIDIA L20 GPU. The released checkpoint, L20-Edu-135M, receives approximately 13B pretraining tokens: 10B FineWeb-Edu tokens followed by a 3B-token educational, mathematics, code, and reasoning mixture. We document the architecture, data gates, cross-source MinHash/LSH near-deduplication, segment deduplication, benchmark-overlap removal, throughput optimization, supervised fine-tuning (SFT) with weight interpolation, and reinforcement learning from verifiable rewards (RLVR) on GSM8K. In a self-run zero-shot six-task harness, L20-Edu-135M obtains a mean score of 0.4150. It trails SmolLM-135M (0.4767) and SmolLM2-135M (0.4917), but its mean is 87.1% of SmolLM-135M’s while its nominal token count is 2.17% as large. This ratio is descriptive, not evidence of statistical equivalence or a controlled scaling law. The model exceeds several older 100M-160M public baselines under the same harness. Direct GRPO-style RLVR decreases GSM8K exact-match accuracy from 1.82% to 1.59% (192-token completions) and 1.21% (320-token completions). These single-run results identify a concrete failure mode rather than establishing a general lower bound on RLVR. The contribution is an auditable resource-constrained case study, not a state-of-the-art claim.

[AI-141] Gated MLPs as Symmetry-Broken Rank-1 Bilinear Attention

链接: https://arxiv.org/abs/2606.22172
作者: Nathan Breslow
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We show that the conventional gated MLP can be viewed as a rank-1 approximation to a bilinear attention mechanism with two distinct factors corresponding to the query and the key. We further show that moving the nonlinearity onto one factor breaks the exchange symmetry between the two factors and, for non-homogeneous activations, the inverse-scaling symmetry as well. This perspective may help explain why gated MLPs are effective in practice and inform the design of future architectures.

[AI-142] Rebuttals Move Peer-Review Scores but Initial-Review Structure Bounds the Movement

链接: https://arxiv.org/abs/2606.22166
作者: Mathieu Louis,Tibo Vanleke,Vincent Ginis,Andres Algaba
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 7 figures, 12 tables

点击查看摘要

Abstract:Author rebuttals are the main post-submission window in peer review, but their effect on reviewer scores remains hard to measure because score updates mix rebuttal content with initial score position, paper-level consensus, reviewer confidence, and discussion dynamics. We study ICLR 2024-2025 using 73,000 reviewer trajectories with externally archived pre- and post-rebuttal scores, and use LLMs only as measurement instruments. Gemini Flash 3.0 predicts implied pre-rebuttal scores from score-stripped review text. The resulting text-score offset predicts later movement, with score-increase rates rising from 8.3% when text reads below the assigned score to 31.9% when it reads above. Claude Opus 4.6 induces, and outcome-blinded Gemini Flash 3.0 validates, a 44-feature taxonomy of resolved reviewer-author exchanges, where 23 features replicate across model and held-out year under Bonferroni correction. In the rebuttal-engaged benchmark (n=6,705), initial-review structure already predicts much score movement (AUC=0.747, minimal AUC=0.696), while adding the resolved exchange raises AUC to 0.804. Rebuttals can move scores, but measurable movement is bounded by initial-review structure, and robust exchange signals are mostly rebuttal failure modes.

[AI-143] KITE: Decoupling Kinematics and Interaction for Zero-Shot Cross-Embodiment Manipulation

链接: https://arxiv.org/abs/2606.22113
作者: Qianxu Wang,Kuan Fang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalizing manipulation policies across robot embodiments remains difficult because standard policies entangle task reasoning with embodiment-specific motor control. We study zero-shot cross-embodiment manipulation, where a policy trained on source embodiments must be deployed on a structurally different target embodiment without additional task demonstrations. We introduce Kinematic Interaction Transfer across Embodiments (KITE), which decouples manipulation into embodiment-agnostic task reasoning and embodiment-specific motor control, connected through a learned latent representation of interaction intent based on contact patterns. Task reasoning is performed by a shared policy that predicts latent intents from source demonstrations, while motor control is performed by an intent-conditioned action decoder learned from each embodiment’s kinematic model. With KITE, adaptation to a new embodiment requires only training a new action decoder using its kinematic model, without recollecting demonstration data. We evaluate KITE on three manipulation tasks spanning transfer between parallel grippers, dexterous hands, and composite embodiments. KITE consistently achieves zero-shot transfer to structurally different target embodiments, outperforming state-of-the-art baselines in transfer success and task-embodiment scope.

[AI-144] CodeTeam: An LLM -Powered Multi-Agent Framework for Repository-Level Code Generation

链接: https://arxiv.org/abs/2606.22082
作者: Yifei Wang,Ruiyin Li,Peng Liang,Qiong Feng,Zengyang Li,Mojtaba Shahin,Arif Ali Khan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 36 pages, 5 images, 9 tables, Manuscript submitted to a Journal (2026)

点击查看摘要

Abstract:Natural language to repository generation (NL2Repo) requires a system to construct an entire software repository from a natural-language requirements document. Compared with function-level code generation, this task demands longer planning horizons, stable interfaces across files, and iterative debugging of cross-file inconsistencies. To address these challenges, we propose CodeTeam, an LLM-based multi-agent framework that separates planning, decision making, and implementation into distinct, coordinated stages. In the planning stage, multiple Architect agents draft competing software design sketches (SDS), optionally grounded by retrieved design references. A CTO agent then evaluates, selects, and normalizes the most promising SDS into a machine-checkable contract that specifies file ownership, public interfaces, and dependency constraints. In the implementation stage, Developer agents generate code under a dependency-aware scheduler with bounded context and lightweight Git-based coordination, while a QA agent runs tests and drives iterative repairs. On the synthesis-based SketchEval benchmark, we explicitly compare CodeTeam’s prompt-engineering (PE) and supervised fine-tuning (SFT) variants with the corresponding CodeS variants, where CodeTeam improves the overall SketchBLEU by 4.1 and 2.9 absolute points, respectively. On the execution-based NL2Repo-Bench benchmark, used as an external validation protocol, CodeTeam achieves the highest average test pass rate in both settings (34.6% PE, 42.3% SFT), confirming that the sketch-improvements extend to functional correctness under upstream test suites. Ablation results show that project-specific developer allocation and retrieval-augmented planning each contribute substantially to the SketchBLEU improvement (9.9% and 8.1% relative, respectively). CodeTeam and the experimental results are available at this https URL

[AI-145] New Smooth Loss functions for Robust Regression that Closely Approximate Absolute Error and Provide Improved Performance on Datasets With Significant Outliers

链接: https://arxiv.org/abs/2606.22068
作者: Mathew Mithra Noel,Arindam Banerjee,Yug D. Oswal,Geraldine Bessie Amali D,Venkataraman Muthiah-Nakarajan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:The performance of supervised machine learning models is directly related to the quality of the training dataset. In particular, the presence of significantly many outliers in the training data can lead to low accuracy because popular loss function like the Mean Squared Error (MSE) assign very high importance to large errors. The Mean Absolute Error (MAE) loss assigns equal importance to all errors and is most robust to outliers, but suffers from being non-differentiable at the origin. MAE also has large derivative values close to its minimum leading to instability and oscillations during training. Thus differentiable approximations to MAE namely Huber and Log-Cosh losses were introduced for robust regression tasks. This paper introduces two new infinitely differentiable loss functions that more closely approximate the MAE loss and provide improved performance on regression tasks with significantly many outliers in the training dataset. A comparison of the performance of regression models with different loss functions on a wide variety of benchmarks and datasets is presented to demonstrate the superior performance of the Square Root Loss (SRL) and Smooth Mean Absolute Error (SMAE) losses proposed in this paper. The SRL loss is shown to be strictly convex and the SMAE loss is shown to be strictly quasi-convex. Given the fundamental importance of linear regression, two new robust linear regression models are presented.

[AI-146] Gradient-Descent Steps to Success over Mean Accuracy: A Paradigm Shift for ML

链接: https://arxiv.org/abs/2606.22053
作者: Riccardo Poli,Ahmet Yilmaz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Traditional evaluation of machine learning (ML) models typically focuses on achieving the maximum possible accuracy irrespective of the computational cost. In this article, we propose a paradigm shift towards evaluating performance based on computational effort-explicitly defined here as the total number of gradient descent steps required to reach an acceptable level of accuracy with high probability. Building upon the concept of computational effort originally introduced by Koza for Genetic Programming, we extend this metric to any ML model trained via gradient descent. Furthermore, we demonstrate that minimising this effort acts as a novel form of Automatic Machine Learning (AutoML). By evaluating it across 11 diverse ML models and five standard classification datasets, we uncover significant insights into the dynamics of gradient-based learning. Our findings reveal that optimal hyper-parameters consistently favour unusually large learning rates. Crucially, we demonstrate that the rapid, aggressive landscape traversal enabled by these large rates not only promotes generalisation-as seen in phenomena like superconvergence-but also statistically minimises the expected computational effort for training. Furthermore, we identify distinct phase transitions in the optimal search strategy: while a single training run suffices for lower accuracy targets, reaching a model’s performance limit requires a dramatic shift towards conducting numerous independent, short restarts. Finally, we illustrate how this effort-based paradigm provides a robust framework for model selection, allowing practitioners to choose optimal algorithms based on the difficulty of a problem as perceived by different models for a given target accuracy, or to maximise the achievable accuracy for a fixed budget of gradient descent steps.

[AI-147] Attractor Domain Theory: A Mathematical Framework for Cardiovascular Attractor Analysis with Wearable Photoplethysmography (PPG) Validation

链接: https://arxiv.org/abs/2606.22039
作者: Timothy Oladunni,Farouk Ganiyu Adewumi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The cardiovascular system evolves along a bounded trajectory in physiological state space that converges to a compact geometric object: the cardiac attractor. A wearable photoplethysmograph (PPG) or electrocardiograph (ECG) observes a one-dimensional projection of this attractor; by Takens’ embedding theorem, delay coordinates reconstruct its full geometry. Three decades of nonlinear cardiac dynamics have extracted Lyapunov exponents, recurrence statistics, and sample entropy from reconstructed attractors, yet no principled account exists of which attractor properties capture which cardiovascular quantities, or why, leaving feature selection as a search problem and negative results uninterpretable. We introduce Attractor Domain Theory (ADT), which proves that the reconstructed attractor’s information partitions into three mutually non-redundant domains: the Geometry Domain G (delay embedding; native capability: artifact rejection), the Ergodic Domain S (asymptotic statistical invariants; native capability: stability estimation), and the Variational Domain V (finite-time Lyapunov exponent field; native capability: hemodynamic inference). We prove a Domain Sufficiency Theorem (the Parseval analog for attractor information) and establish that three domains are necessary and sufficient. Geometry Domain validation via the SCSI framework across 176,742 PPG segments from four datasets yields AUC = 0.757 [0.686-0.828] and NPV = 0.966 after correcting three systematic evaluation artifacts (+0.179 net inflation). Ablation confirms C_NL as the dominant Geometry Domain component (Delta AUC = -0.413) and intra-domain redundancy across five components.

[AI-148] A Completion-Aware Framework for Impactful Counterfactual Explainability in Graph Neural Networks ECML KDD2026

链接: https://arxiv.org/abs/2606.22033
作者: Maria Myrto Villia,Filippos Gouidis,Theodore Patkos,Panos Trahanias
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ECML PKDD 2026

点击查看摘要

Abstract:In this study, we propose a novel pipeline for generic, model-agnostic, local-level counterfactual explainability in graph neural networks (GNNs). Although counterfactual explainers capable of both adding and removing edges have emerged in recent years, the need for generic and efficient solutions remains unmet, particularly concerning qualitative explanation generation. Our approach couples progress in factual explainability with missing edge prediction models rooted in link prediction research, in order to enhance the quality, robustness and intuitiveness of explanations. A multi-faceted experimental analysis conducted on real-world and synthetic graph classification benchmarks, both binary and multi-label, demonstrates the advancements in comparison to state-of-the-art baselines across diverse metrics.

[AI-149] RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

链接: https://arxiv.org/abs/2606.22027
作者: Pengzhi Yang,Xinyu Wang,Pengyu Jing,Kehan Wen,Yiduo Qu,Zhenhao Huang,Minghao Fu,Xin Liu,Yaheng Shen,Fan Shi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning for robot manipulation is often bottlenecked by reward design, especially in long-horizon tasks: sparse success rewards provide weak supervision, while hand-crafted dense rewards are tedious to design and generalize poorly across tasks. Progress-based reward models offer a promising alternative by estimating how far an observation has advanced toward task completion, but existing approaches often require task-specific demonstrations or progress labels, and can assign high rewards to visually plausible but physically incorrect states. We introduce the Reference-Anchored Reward Model (RARM), a lightweight visual comparator that converts a single successful demonstration into a dense, progress-aware reward. RARM is trained once on general-purpose videos with a contrastive temporal objective, requiring no robot-specific data, task-specific reward labels, or per-task reward engineering. At deployment, RARM matches rollout clips to reference clips and rewards only confident forward progress, suppressing uncertain matches that may otherwise produce false-positive rewards. Across 9 simulated manipulation tasks from LIBERO and MetaWorld and 4 real-world tasks, RARM achieves the best overall success rates in subsequent RL training, with particularly large gains on long-horizon tasks such as cloth folding, where unreliable progress estimates are especially harmful.

[AI-150] Cluster-Specific Localized Drift Detection for Efficient Batch Model Adaptation under Controlled Distribution Shift

链接: https://arxiv.org/abs/2606.22026
作者: Ignacio Cabrera Martin,Marcello Trovati,Almas Baimagambetov,Nikolaos Polatidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning systems deployed in dynamic environments frequently operate under nonstationary data distributions, where controlled distribution shift can progressively degrade predictive performance. However, many widely used tabular benchmark datasets lack explicit temporal structure, limiting reproducible evaluation of drift adaptation methods. This work proposes a cluster-induced distribution shift simulation framework that transforms static tabular datasets into controlled evolving data streams through structured perturbations across featurespace partitions. Using this framework, six adaptation strategies are systematically evaluated: static learning, sliding-window retraining, global ADWIN retraining, cluster-local ADWIN retraining, random subspace drift detection, and feature-partitioned drift detection. Experiments are conducted on five benchmark datasets covering both classification and regression tasks using diverse predictive model families, including linear models, k-Nearest Neighbours, tree ensembles, boosting methods, and adaptive online learners.

[AI-151] Channel Location Constrains the Auditability of Subliminal Learning

链接: https://arxiv.org/abs/2606.22019
作者: Tamas Madl
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Subliminal learning lets a student inherit a teacher’s hidden trait from distillation data that never names it. We ask when such transfer can be audited before training. The answer is not model identity or scale alone, but channel location: the carrier through which the trait reaches the student. We find three regimes. In a controlled initialization-dependent body channel, a pre-training screen works. Coverage, the cosine between the student’s initial distillation update and the teacher’s fine-tuning displacement, predicts held-out transfer (Spearman \rho \approx 0.95 ; AUROC 0.997). In pretrained language models, masked single-token traits instead ride convergent vocabulary geometry. This channel is initialization-independent, so initialization-alignment screens, including coverage, are not mechanistic; the useful handles are post-hoc detection and targeted mitigation. Even when a single-token named entity is removed from the loss, the student’s held-out probability for that entity rises to 0.40 on average ( \sim 2500\times ), and a related semantic class transfers. In an untied-head model, orthogonalizing the trait’s output row against entangled neighbours collapses leakage, while equal-size random-subspace edits do not. Thus removing a target string from distillation labels does not remove the corresponding preference: neighbouring tokens can carry it. Finally, conditional behaviours can route through the network body. For sycophancy, with agreement and correction markers masked from the loss, transfer reaches about 0.63 of the teacher’s effect, localizes to body computation, and evades four audits across two model families. We scope this as masked transfer of a condition-present policy. Channel location is necessary for deciding which audits can be sound. It is not a deployment-ready screen: an audit used outside its carrier regime can give false assurance.

[AI-152] IRumAI: Reinforcement Learning for Indian Rummy

链接: https://arxiv.org/abs/2606.21975
作者: Vignesh Mohan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Despite its massive player base and complex hidden-information dynamics, Indian Rummy has received no reinforcement learning attention. Existing agents rely on combinatorial search, which is tactically strong but slow at inference. We present IRumAI, the first RL agent for the domain. IRumAI integrates Proximal Policy Optimization (PPO), meld-aware observation encoding, deadwood-driven reward shaping, and a dual-branch convolutional architecture. IRumAI is RL-trained solely against weak heuristics, after a one-time behaviour-cloning warm-start on stronger demonstration data. It generalises to defeat the entire baseline hierarchy, including a 53.9% win rate against the strongest search-based opponent unseen during RL training. Bypassing explicit search, IRumAI requires just 0.33 ms per action, which is over 7,000x faster than the state-of-the-art heuristic. Ablations validate our architectural choices, and linear probing reveals that the network implicitly models the opponent’s hidden hand from public interactions.

[AI-153] SPOTR: Spatio-temporal Pooling One-Token Reconstruction for Universal Physiological Signal Self-supervised Learning ECAI2026 IJCAI

链接: https://arxiv.org/abs/2606.21973
作者: Yiyu Gui,Mingzhi Chen,Yuesheng Zhu,Guibo Luo,Yuchao Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by IJCAI-ECAI 2026

点击查看摘要

Abstract:Physiological signals such as EEG, ECG, and PPG are widely used in clinical monitoring. Recent self-supervised learning (SSL) methods offer an attractive way to leverage unlabeled recordings, yet they still fall short in practice. In particular, current SSL methods struggle across heterogeneous datasets, often distorting clinically meaningful structures or learning shortcuts from temporal and cross-channel redundancy. Consequently, existing SSL methods often deliver limited performance under linear probing, a lightweight adaptation setting that better matches real-world medical scenarios. Moreover, most Transformer-based SSL models encode a flattened spatiotemporal token sequence, incurring high computation and memory cost, and are typically developed within a single modality. To address these limitations, we present SPOTR (Spatio-temporal Pooling One-Token Reconstruction), a compress-reconstruct pretraining framework that introduces a single-token global bottleneck for physiological signals. SPOTR compresses each waveform into a single-token representation and reconstructs the signal conditioned only on this representation. Meanwhile, SPOTR introduces an efficient spatio-temporal compaction module to reduce computation and memory cost. Pretrained on 20 datasets spanning EEG, iEEG, ECG, and PPG, SPOTR consistently outperforms the strongest baseline under linear probing, improving average AUC by 18.49%, 21.71%, 17.86%, and 4.64%, respectively. Compared with a representative general-purpose time-series foundation model, SPOTR achieves around 78% lower latency and 52% lower peak GPU memory on average. The code can be found at this https URL.

[AI-154] Human vs Machine Mathematical Difficulty on Project Euler: An Experimental Analysis

链接: https://arxiv.org/abs/2606.21972
作者: David Holmes,Johannes Schmitt
类目: Artificial Intelligence (cs.AI); History and Overview (math.HO)
备注: 33 pages, comments welcome!

点击查看摘要

Abstract:We study how the effort and success probability of frontier AI systems scale with human difficulty on problems from Project Euler, an online platform of computational mathematics problems. Our dataset, from the MathArena benchmark, consists of 3840 attempts across 50 problems and 26 model configurations, with problem difficulty measured by the site’s public human solve times. Motivated by a proposal of Timothy Gowers, we test a power-law relation t_\textmachine = a \cdot t_\texthuman^b between generated-token cost per successful answer and human time, and find b 1 for 20 of the 25 models with usable fits, including the strongest base models; this operationalization therefore does not support an earlier prediction that machines scale worse than humans with difficulty. We also investigate whether success probability on the tested problems can be modeled by a simple exponential decay p_\textsuccess = e^c t_\texthuman , predicting a linear relation between \log p_\textsuccess and t_\texthuman . Using a binning approach for data aggregation we find moderate empirical support (median bin-level R^2 = 0.92 across the 22 best-covered configurations) for this model. Following METR, we also fit logistic success curves and extract 50% task-length horizons h_50 ; the strongest configurations in our 20 April 2026 snapshot reach roughly 2.5 – 4.3 hours on our fastest-five human baseline, with a log-linear fit through the state-of-the-art frontier giving a descriptive doubling time of about 75 ~days for the SOTA h_50 .

[AI-155] REBA: A Revealed Belief Automaton Framework for Online Planning in Continuous POMDPs

链接: https://arxiv.org/abs/2606.21971
作者: Xiangwei Chen,Lingling Fang,Andreas Holzinger,Liming Chen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online planning in continuous partially observable Markov decision processes (POMDPs) using \omega -regular specifications requires handling continuous belief dynamics within the finite symbolic memory in order to track temporal progress. Existing methods based on either direct search in belief space or predefined discrete abstractions suffer from drawbacks, e.g., lack of symbolic memory for long-horizon logical progress or difficult to certify from noisy online beliefs. As such, obtaining reliable symbolic states online from continuous observations remains a challenge. To address this issue, we introduce the Revealed Belief Automaton (REBA), an event-driven framework that advances the research from global belief-space discretization to a fundamental new way of thinking, namely online certification of revelation events. Specifically, we propose an online revelation method that, through information-theoretic gates, can dynamically analyse and establish belief abstraction from the continuous belief space by discovering reliable anchors among noisy beliefs. We then develop an incremental topology adaptation mechanism over the certified anchors to realise the online finite Belief Automaton. By combining with the \omega -regular specification, REBA is able to support formal parity policy synthesis without a predefined discrete abstraction, which in turn can guide the Monte Carlo Tree Search process to perform online search beyond its local horizon. In addition, we design an error decomposition analysis which can assess the effectiveness and reliability of this discrete guidance for the underlying continuous POMDP. Empirical evaluations in patrolling and navigation scenarios show that REBA matches or exceeds all evaluated baselines, with primary metric gains of +17.0% to +47.4% over state-of-the-art approaches.

[AI-156] Holmes: Multimodal Agent ic Diagnosis for Mixed-Language Mobile Crashes at Industrial Scale

链接: https://arxiv.org/abs/2606.21963
作者: Jia Li,Wenyuan Ma,Ting Peng,Haibin Zheng,Yuetang Deng
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at FSE’26 Industry Track

点击查看摘要

Abstract:Diagnosing mobile crashes in ultra-large-scale industrial applications is a formidable challenge due to the sheer volume of code, the complexity of mixed-language environments, and the inability to reproduce failures locally. Traditional static analysis struggles with scalability, while existing LLM-based agents often rely on reproducible environments unavailable in post-mortem scenarios. We present Holmes, a multi-agent system that automates root cause analysis by synthesizing multimodal runtime signals–stack traces, logs, and thread states–to reconstruct failure contexts without reproduction. Holmes introduces a hierarchical Retrieve-Explore-Reason architecture that leverages low-level artifacts (e.g., registers, assembly) to bridge the semantic gap between open-source business logic and closed-source system frameworks. By dynamically compressing the search space using runtime clues, Holmes precisely navigates 70-million-line codebases to identify non-local defects. Evaluated on real-world crashes from WeChat, Holmes achieves 87.6% accuracy in function-level fault localization and reduces average investigation time by over 98% (to ~77 seconds), demonstrating its effectiveness in transforming labor-intensive debugging into an efficient verification workflow.

[AI-157] From RAN Control to Agent ic Intelligence: Architecture and Vision for Energy Efficient AI-RAN

链接: https://arxiv.org/abs/2606.21955
作者: Sabrine Aroua,Alexis I. Aravanis,Ilias Chatzistefanidis,Hamza Abbar,Anh-Khoa Dang,Anastasios Giovanidis,Salah-Eddine El Ayoubi,Stephane Senecal,Martha Vlachou Konchylaki,Navid Nikaein
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: preprint under review at IEEE Network Magazine

点击查看摘要

Abstract:Future 6G networks will rely on highly distributed, AI-native Radio Access Networks (RANs), where communication and AI workloads share a common infrastructure. This evolution, combined with increasing deployment density and continuous AI processing, is expected to significantly increase RAN energy consumption. While Open RAN (O-RAN) introduces a programmable and modular control framework through the RAN Intelligent Controller (RIC) and Service Management and Orchestration (SMO), current approaches remain largely policy-driven, limiting adaptive energy-aware coordination across multiple applications. In parallel, AI-RAN promotes the convergence of AI and RAN infrastructures through AI-for-RAN, AI-on-RAN, and AI-and-RAN paradigms, yet efficient mechanisms to jointly orchestrate performance, latency, and energy remain an open challenge. This article proposes an agentic AI-native RAN architecture that bridges O-RAN’s structured control with AI-RAN’s unified vision. Leveraging semantic intent abstraction and Large Language Model (LLM)-driven coordination, the framework enables adaptive orchestration, conflict resolution, and energy-aware multi-objective optimization across heterogeneous workloads. Through representative AI-for-RAN and AI-on-RAN use cases, we show how such coordination can improve resource efficiency and reduce operational energy consumption, paving the way toward sustainable 6G networks.

[AI-158] Modularized Reinforcement Learning on LLM s: From MDP Creation to Exploration and Learning

链接: https://arxiv.org/abs/2606.21943
作者: Zhao Yang,Yuxuan Jiang,Ting-Chih Chen,Lincen Yang,Annie Wong,Chao Gao,Jacob E. Kooi,Zhong Li,Jiayang Shi,Kevin Qiu,Qi Huang,Xinrui Zu,Shiping Yang,Hengyuan Zhang,Ngai Wong,Filip Ilievski,Shujian Yu,Aske Plaat,Zhaochun Ren,Mark Hoogendoorn,Vincent François-Lavet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become central to LLM post-training, yet the methods that dominate current pipelines, PPO and GRPO, represent only a narrow slice of what RL offers. Understanding why these methods prevail, and what alternatives exist, requires a principled examination of the design decisions that underlie any RL algorithm. This survey organizes that examination around three stages of algorithm construction. We begin with MDP creation: how the reward function, state space, action space, termination condition, and discount factor are, or could be, defined for LLM training. We then turn to exploration, covering temperature sampling, entropy regularization, intrinsic motivation, tree search, and curriculum learning. Finally, we address learning along four classical RL dimensions: model-free versus model-based, value-based versus policy-based versus actor-critic, on-policy versus off-policy, and credit assignment, including both Monte Carlo methods, which rely on full return estimates, and bootstrapping methods, which update estimates using other learned predictions. Mapping the LLM literature onto this taxonomy reveals a strikingly non-uniform distribution of research effort. Critic-free policy gradients and Monte Carlo credit assignment are densely populated, while value-based methods, off-policy actor-critic training, and bootstrapping-based credit assignment remain largely unexplored despite well-established counterparts in classical RL. These gaps represent concrete opportunities for transferring proven RL techniques to LLM training. By making these gaps explicit alongside the methods that have proven effective, this survey offers researchers in both RL and LLMs a shared framework for understanding current practice and identifying promising directions for future work. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.21943 [cs.LG] (or arXiv:2606.21943v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.21943 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhao Yang [view email] [v1] Sat, 20 Jun 2026 08:20:41 UTC (3,944 KB)

[AI-159] Skills for the future software profession: beyond agent ic AI!

链接: https://arxiv.org/abs/2606.21894
作者: Sungmin Kang,Baishakhi Ray,Abhik Roychoudhury
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As coding agents are rapidly changing software engineering, a natural question is: what are the core skills needed by future software engineers? To identify where software engineering is headed and thus what skills will be needed, we summarize the results of two round-tables with researchers and industrial practitioners, held in 2026 in New York and Singapore. One key finding is that verification and validation is increasing in importance as agents handle implementation, as highlighted by anecdotes from the events. From our observations, we identify the skills developers need in the agentic era of development, with implications for training and educating future software engineers in coming years.

[AI-160] Improving Engine Sound Analysis in Hot-Test Environments via a RAB-U-Net (Residual Attention Block U-Net) Noise Removal Method

链接: https://arxiv.org/abs/2606.21887
作者: Raheleh Mohseni,Mahdi Alyari
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:During hot tests on a production line, engine-sound analysis is crucial to ensuring product quality and performance. However, background noise often interferes with accurate sound analysis, leading to potential errors in engine diagnostics. Traditionally, skilled technicians listen to engine sounds to assess engine health, but this is prone to significant inaccuracies. This study presents an innovative deep learning-based approach to address this issue by removing background noise from engine sound recordings using a U-Net neural network structure enhanced with Residual Attention Blocks (RAB-U-Net). Our intelligent noise removal system significantly improves the accuracy of engine noise detection, outperforming traditional techniques and providing a robust solution for real-time applications in production line environments. This study proposes a novel system for engine noise detection in production lines, marking a valuable advancement for the automotive industry in applying deep learning methods to improve the quality of engine diagnostics.

[AI-161] Cohort-Anchored Foundation Models for Electronic Health Records: From Risk Scores to Auditable Peer Cohorts

链接: https://arxiv.org/abs/2606.21885
作者: Kaiping Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have achieved remarkable performance across medical question answering, imaging, and electronic health record (EHR) tasks, yet reliable clinical deployment remains challenging due to limited interpretability, vulnerability to distribution shift, and weak alignment with clinician reasoning. We argue that these limitations arise because existing approaches prioritize representation learning while treating patient comparison as an emergent property rather than a primary source of clinical evidence. To address this gap, we propose CAFM, a Cohort-Anchored Foundation Model framework that elevates patient cohorts to a first-class object throughout the learning pipeline. The framework consists of four stages: deviation-aware data curation, cohort-conditioned pretraining, multimodal cohort alignment, and clinician-in-the-loop refinement. Together, these stages improve data quality, organize representations around clinically meaningful cohort structure, preserve modality-specific relationships, and support auditable clinical decision-making. The framework is compositional and can augment existing EHR foundation models without modifying their underlying encoders. We illustrate CAFM through four clinical case studies spanning acute kidney injury prediction, cardiovascular risk stratification from electrocardiograms, optic neuropathy triage from orbital imaging, and electroretinogram-grounded report generation. We further present five empirically testable hypotheses and identify open challenges in data quality, irregular temporality, multimodal learning, distribution shift, and evaluation beyond predictive accuracy. We argue that explicitly anchoring foundation models to patient cohorts provides a principled path toward trustworthy clinical AI.

[AI-162] Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead INTERSPEECH2026

链接: https://arxiv.org/abs/2606.21882
作者: Muyang Du,Jason Roche,Junjie Lai
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 4 tables, Interspeech 2026

点击查看摘要

Abstract:Streaming text-to-speech synthesis in cascaded LLM-TTS systems still faces latency challenges as most TTS models require full context before initiating generation. We present S5-TTS, a streaming variant of T5-TTS that enables low-latency, word-by-word incremental speech synthesis through encoder-decoder language modeling and monotonic alignment learning. S5-TTS begins generating speech immediately after receiving the first few words, substantially reducing end-to-end response latency. To maintain quality under limited lookahead, we introduce a lookahead-causal masking mechanism with Conv-based auxiliary attention that preserves intelligibility and speaker similarity, and employ interleaved multi-source distillation to further restore naturalness. Experiments show that S5-TTS achieves comparable quality to full-context T5-TTS, supports zero-shot synthesis with high speaker similarity, and significantly reduces end-to-end latency for practical conversational AI systems.

[AI-163] Agent RiskBOM: A Risk-Scoping Security Bill of Materials for Agent ic AI Systems

链接: https://arxiv.org/abs/2606.21877
作者: Srimonti Dutta,Akshata Kishore Moharir
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: Accepted at IEEE International Conference on Cybersecurity and AI-Based Systems (Cyber-AI 2026)

点击查看摘要

Abstract:Agentic AI systems retrieve private context, invoke tools, write files, call external services, coordinate with other agents, and may act without human approval. Existing bill of materials artifacts improve transparency for dependencies, model metadata, and training provenance, but leave an agentic transparency gap: capability opacity, the absence of a structured account of what a deployed agent can access, remember, change, delegate, and prove afterward. This paper introduces AgentRiskBOM, a security BOM for risk-scoping tool-using AI agents. It is an additive layer over SBOM, AIBOM, and MLBOM artifacts, referencing them where authoritative while adding fields for runtime authority: autonomy, tool permissions, memory, credential scope, approval gates, audit signals, inter-agent communication, and external action capability. We implement AgentRiskBOM as a JSON-schema artifact with a reproducible corpus, risk scenarios, scorer, diff detector, control mapper, and reports. We evaluate AgentRiskBOM on 13 open-source agents spanning coding, RAG, and multi-agent archetypes, plus 52 risk scenarios across 14 categories. The schema validates all 13 corpus artifacts. Coverage analysis gives AgentRiskBOM a native-equivalent score of 14 across 16 capability dimensions, vs. 1 for SBOM, 1.5 for AIBOM and 2 for MLBOM. Across modeled risk categories, AgentRiskBOM exposes 100% risk-category visibility vs. 10.5% for SBOM-like and 20.9% for AIBOM-like views. To test agentic authority drift, we inject 33 structured deployment mutations; the diff detector identifies the correct change type for all mutations. A secondary penalty-based scorer yields a Spearman correlation of 0.73 with the primary scorer, supporting rank-level consistency while showing that thresholds require human calibration. The results show that agentic AI security needs a machine-readable authority-and-risk artifact before incidents occur.

[AI-164] Protein contacts are already in the attention: a single-forward-pass alternative to the Categorical Jacobian

链接: https://arxiv.org/abs/2606.21876
作者: Rome Thorstenson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures. Code and data: this https URL

点击查看摘要

Abstract:The Categorical Jacobian (CJ) of Zhang et al. (2024) reads protein contacts from a language model by perturbing every residue with every alternative amino acid, about 19L forward passes. We show the signal it reconstructs is already concentrated in a small subset of attention heads: averaging the top-K contact-relevant heads, selected on as few as 10 labeled proteins, recovers contacts in one forward pass and beats CJ on leakage-clean data for every bidirectional model where CJ is defined, and matches or beats it in-distribution (the exceptions being the smallest 8M model and a statistical tie on ESM Cambrian). Ablations localize the gain to labeled head selection, not averaging: at a matched label budget the unweighted mean ties a supervised L1 logistic regression on the same heads, so the parameter-free mean is selection’s minimal form, not the source of the advantage. Our primary test is leakage-clean: on a CAMEO split where neither selection nor evaluation touches data the models have plausibly memorized, the head readout beats CJ on ESM-2-650M by +9 pp (N=29, p0.001), with the within-model margin reproducing across architectures on a wider pretraining-aware set. Both methods fall 30-36 percentage points from their in-distribution Zhang numbers to the leakage-clean numbers, consistent with substantial pretraining overlap inflating prior numbers (a CAMEO-vs-Zhang difficulty shift contributes too, so we read it as an upper bound on the leakage component). We additionally introduce representation-CJ, a hidden-state generalization of the Jacobian for architectures without a masked-LM head; show that the optimal K tracks how diffusely a model spreads its contact heads; and find that both methods lose the contact signal on both causal LMs we test (ProGen2), suggesting attention-encoded pair structure may depend on bidirectional pretraining.

[AI-165] Harness-MU: A Safe Governed and Effective Harness for Multi-User LLM Agents

链接: https://arxiv.org/abs/2606.21856
作者: Wangxuan Fan,Xiaoyu Nie,Zhongxiang Dai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 15 tables

点击查看摘要

Abstract:The increasing deployment of large language model (LLM) agents in collaborative workflows demands robust multi-user, multi-principal interaction mechanisms capable of enforcing access permissions, resolving authoritative conflicts, and preventing unauthorized data disclosure. However, a fundamental mismatch exists between the single-user training paradigm of contemporary LLMs and the hard constraints required for multi-principal governance, rendering probabilistic, prompt-based safeguards vulnerable under multi-turn adversarial this http URL key insight is that governance constraints – who is authorized, what is restricted, and whose instructions take precedence – are deterministic runtime variables that should be enforced by execution hooks rather than entrusted to the LLM. We present \textbfHarness-MU, the first model-agnostic, zero-tuning infrastructure framework for multi-user LLM agents. By decoupling language generation from safety orchestration, Harness-MU guarantees unbreakable permission boundaries while maximizing compliant demand satisfaction. Across four frontier open-weight and proprietary models on the \textitMuses-Bench benchmark, Harness-MU achieves the goal of privacy preservation across all access-control attacks, outperforming the standard baseline by 0.28–0.39 in utility score and improving instruction-following accuracy by up to 48.9 percentage points. Harness-MU advances the philosophy of \textitHarness Engineering, establishing that systematic infrastructure is essential for solving LLM multi-principal governance challenges. The code and data are available at this https URL.

[AI-166] UniRank: Unified Rank Allocation for Low-Rank LLM Compression

链接: https://arxiv.org/abs/2606.21847
作者: Chao Han,Haozhe Hu,Fei Ma,Wei Zhang,Xiaoyu Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-rank decomposition serves as a promising compression paradigm for large language models, however, rank allocation remains challenging: manual rules lack generalizability, and learning-based approaches incur heavy computational overhead. To address these issues, we formulate global low-rank allocation as a sorting-and-truncation pipeline, and score each singular component via dual criteria: \textbfLocal singular energy ratio that quantifies the intrinsic importance within the decomposed parameter matrix and \textbfGlobal functional importance (measured by input-output cosine similarity) that evaluates the functional significance of decomposed modules. We verify the strong correlation between high input-output cosine similarity and low effective rank through geometric interpretation and experimental validation. Furthermore, we propose rank-preserving fine-tuning, which performs direct LoRA tuning on decomposed weights and avoids extra information loss caused by re-truncation in conventional merging pipelines. Empirical results confirm that our method delivers sustained performance enhancements when combined with models featuring distinct decomposition schemes, model sizes and architectural designs, e.g. in one-shot compression without further fine-tuning, our method reduces perplexity by up to 50% compared with uniform and heuristic allocation baselines. Code will be available at this https URL.

[AI-167] Agent DSE: Reasoning -Augmented Architectural Design Space Exploration ISCA2026

链接: https://arxiv.org/abs/2606.21836
作者: Chenyu Wang,Jiahe Caroline Shi,David Kong,Duane S. Boning,Zishen Wan,Yilun Du,Vijay Janapa Reddi
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

点击查看摘要

Abstract:Traditional architectural design space exploration (DSE) is highly inefficient, typically requiring tens of thousands of simulator evaluations across various optimization methods. This inefficiency arises because conventional methods treat the simulator as a black-box oracle. In contrast, human architects effectively guide exploration by reasoning through physical constraints, performance bottlenecks, data reuse, and workload structures. To bridge this gap, we introduce AgentDSE, a simulator-in-the-loop methodology driven by a general-purpose large language model (LLM) coding agent. AgentDSE automates this architectural-reasoning loop without requiring model fine-tuning, precomputed design databases, or domain-specific optimizer code. Across deep neural network (DNN) accelerator mapping, hardware/software co-design, and CPU cache-hierarchy optimization, AgentDSE achieves competitive or better design quality with up to two orders of magnitude fewer evaluations. AgentDSE also produces inspectable traces that surface architectural hypotheses, performance cliffs, implicit priors, and simulator artifacts, making every search decision traceable rather than buried in optimizer state.

[AI-168] Agent CAT: Simulating Computerized Adaptive Testing via Multi-Agent Large Language Models

链接: https://arxiv.org/abs/2606.21832
作者: Weiyuan Zhou,Haiping Ma,Xiaoshan Yu,Changqian Wang,Shangshang Yang,Xingyi Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computerized Adaptive Testing (CAT), as a key technology for personalized education, aims to accurately assess examinee proficiency by retrieving exercises dynamically matching current ability estimates. However, existing CAT research is constrained by limitations of static offline data and isolated component optimization. Restricted by partial labels in offline logs, researchers degrade the dynamic assessment process into static sequence prediction. Current research focuses on isolated perspectives, e.g., selection or diagnosis, neglecting the overall CAT interaction process. To address this, we propose AgentCAT, a Large Language Model-based multi-agent simulation system, to construct a high-fidelity benchmarking environment for dynamic testing. This framework comprises three modules: (1) The examinee agent with memory retrieval and Chain-of-Thought reasoning simulates responses based on cognitive profiles; (2) The selection agent uses coarse-to-fine bucketing and knowledge graph exploration to balance local difficulty and global coverage; (3) The supervisor uses dual-auditing and robust update to ensure convergence and validity. To validate the framework, we evaluated on two real-world datasets across three dimensions: macro-level ability convergence, micro-level interaction logic, and data sparsity resilience. Results show AgentCAT achieves effective ability estimation, and its selection strategy balances difficulty adaptation and instructional coherence, aligning with human pedagogical intuition.

[AI-169] CNnotator: LLM -Guided Memory Safety Annotation Synthesis ICSE2026

链接: https://arxiv.org/abs/2606.21822
作者: Twain Byrnes,Mike Dodds
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 6 pages. Published at ReCode 2026 (1st Workshop on Code Translation, Transformation, and Modernization), co-located with ICSE 2026. This version corrects the description of the property-based testing backend (Bennet, built on Fulminate) relative to the published version

点击查看摘要

Abstract:Memory safety errors account for a large proportion of security bugs in systems written in C; modern languages such as Java and Rust prevent such bugs because they are memory-safe by design. To migrate systems to safer languages or identify memory errors, we must first determine how legacy code manipulates memory. This information is only represented implicitly in such code. In many cases, memory usage patterns are merely tedious for humans to figure out, rather than truly difficult. In this work, we ask if large language models (LLMs) can perform this task by having them synthesize annotations representing memory usage as specifications in CN, a hybrid testing/verification tool. Our tool, CNnotator, uses LLMs to automatically generate and test CN specifications. We find that current models are able to generate CN specifications for small-to-medium C programs, with the OpenAI o3 reasoning model achieving a 90% success rate on first attempts and 97% overall success, while the chat model GPT-4o correctly annotates 65% of first attempts. These results suggest AI-assisted annotation is becoming practical for real-world C codebases. Comments: 6 pages. Published at ReCode 2026 (1st Workshop on Code Translation, Transformation, and Modernization), co-located with ICSE 2026. This version corrects the description of the property-based testing backend (Bennet, built on Fulminate) relative to the published version Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) ACMclasses: D.2.4; F.3.1; I.2 Cite as: arXiv:2606.21822 [cs.PL] (or arXiv:2606.21822v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2606.21822 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 1st Workshop on Code Translation, Transformation, and Modernization (ReCode '26), April 12-18, 2026, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 6 pages Related DOI: https://doi.org/10.1145/3786180.3788313 Focus to learn more DOI(s) linking to related resources

[AI-170] Steer Dont Solve: Training Small Critic Models for Large Code Agents

链接: https://arxiv.org/abs/2606.21811
作者: Shubham Gandhi,Yiqing Xie,Atharva Naik,Ruichen Zhu,Carolyn Rose
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:End-to-end code agent training is resource-intensive and plateaus on the strategy-level reasoning needed to resolve code issues, since jointly optimizing code-level execution and strategy-level reasoning leaves the latter underdeveloped. Instead, we freeze the agent and add a critic model to supply that signal. Prior code critics are post-hoc, scoring completed trajectories rather than steering the agent; we instead train a small critic that provides intra-trajectory feedback via Supervised Fine-Tuning. On SWE-bench Verified, a critic trained on CWM-32B trajectories transfers to two unseen agents (gains of +3.0 to +3.8 points), and adding target-agent trajectories to the corpus increases the gain to +3.8 on CWM-32B and +4.4 to +5.2 on two Qwen agents, at 30-92x lower critic cost than a strong teacher. On Qwen3-Next-80B-A3B, the critic-guided system is both more accurate (25.2% vs. 20.8%) and cheaper (\ 0.04 vs. \ 0.11) than the agent alone, because the critic also shortens trajectories. Our results show that a small, well-trained critic is a practical complement to scaling agent training. Code: this https URL. Data and models: this https URL

[AI-171] HREAD: Trajectory Planning for Hybrid Rigid-Soft Manipulators with Environment-Aware Diffusion IROS2026

链接: https://arxiv.org/abs/2606.21792
作者: Shivani Kamtikar,Pranav Asthana,Naveen Kumar Uppalapati,Girish Krishnan,Girish Chowdhary
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL , IROS 2026

点击查看摘要

Abstract:Manipulation in confined environments, such as threading a manipulator through narrow apertures, remains a fundamental challenge, especially for conventional rigid robots. Hybrid rigid-soft manipulators offer promise but face two compounding planning challenges: backbone shapes feasible in free space become infeasible under environmental contact, and planning rigid and soft segments independently ignores their kinematic coupling. We present THREAD, the first diffusion-based trajectory planner for hybrid manipulation, learning a generative prior over physically realizable backbone trajectories conditioned on local environment geometry, with physics-inspired losses encoding curvature, smoothness, and collision constraints jointly across both segments. Trained in simulation, THREAD achieves 92.4% task success with 5x fewer collisions than the strongest baseline. We show cross-embodiment real-world transfer with minimal online updates, successfully threading through apertures as small as 1.3x the soft segment diameter.

[AI-172] Beyond the Next Step: Variable-Length Latent World Models for Long-Horizon Planning

链接: https://arxiv.org/abs/2606.21775
作者: Tianqi Du,Qi Zhang,Yifei Wang,Yisen Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, world models have emerged as a promising paradigm for building intelligent agents by learning predictive models that estimate future environment states conditioned on observations and actions. In particular, JEPA-style latent world models provide an efficient alternative to pixel space prediction by learning action-conditioned dynamics in compact representation spaces. However, existing latent world models typically rely on one-step prediction and must be recursively rolled out for long-horizon planning, which leads to compounding errors and a mismatch between training objectives and downstream planning tasks. To address this limitation, we propose Variable-length Latent World Models (VLWMs), a framework that learns to predict future latent states conditioned on action sequences of variable lengths. Instead of training only on one-step transitions, VLWMs directly model temporally extended dynamics, allowing the same predictor to evaluate action plans over different horizons. We further introduce a curriculum training strategy that progressively expands the action horizon, stabilizing optimization from short-range dynamics to long-range prediction. At test time, we design planning methods tailored to VLWMs to better exploit their variable-length predictive capabilities. Experiments on long-horizon control tasks show that VLWMs significantly improve latent space world models, achieving 13% average improvement over the state-of-the-art LeWM across different datasets, with especially large gains on tasks requiring extended planning. These results suggest that VLWM provides a simple yet effective paradigm for improving long-horizon prediction and planning in latent world models.

[AI-173] raining the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

链接: https://arxiv.org/abs/2606.21740
作者: Rajesh Mangannavar,Zachary Coalson,Pranay Dugar,Prasad Tadepalli
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 2 figures

点击查看摘要

Abstract:Translating natural-language planning intent into verified plans is a longstanding challenge: people communicate goals in language, while classical planners require formal PDDL specifications. Recent agentic frameworks bridge this gap by orchestrating a pool of specialized repair agents inside a verifier-checked refinement loop, but the orchestrator at the centre is itself a prompted frontier LLM, paying a frontier-LLM API call at every refinement step. We present HALO (Hybrid Agent-Learned Orchestrator), which trains the orchestrator from refinement trajectories that an external verifier has certified as ending in valid plans, across 11 PDDL domains. HALO pairs a small QLoRA-tuned policy with three hardcoded rules for trivially decidable selections, and operates over an expanded 21-agent action space. Unlike approaches that prompt a frontier LLM at every step or learn an orchestrator from sparse end-of-episode rewards, our key observation is that the verifier already provides strong guidance: every accepted trajectory is a sequence of demonstrably correct (state, agent) decisions, directly usable as supervision. Across PlanBench, Natural Plan, and classical planning benchmarks, HALO matches or exceeds the GPT-5-mini prompted baseline on success rate, sits within three percentage points of the stronger Gemini-3-Flash prompted baseline, reduces orchestration cost by more than an order of magnitude (\ 0.18 to \ 0.004 per task against GPT-5-mini, roughly 45 \times cheaper; roughly 15 \times cheaper than Gemini-3-Flash), and cuts total LLM calls per episode by 40 to 50 percent.

[AI-174] Safe to Check Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

链接: https://arxiv.org/abs/2606.21732
作者: Zesen Liu,Zihan Zhang,Dongdong She
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Summarization-based prompt compression is increasingly used by LLM agents to shorten long, distributed contexts, but it shifts the security boundary: filters inspect the pre-compression prompt while the backend acts on a newly generated compressed context. We identify relinking, a compression-boundary vulnerability where the compressor behaves as a confused deputy, summarizing distributed, locally benign fragments into a complete malicious instruction. Unlike prompt injection, relinking need not place an explicitly malicious payload in the source context. We show that relinking arises from summarization itself: attention makes separated fragments jointly available, pre-training makes compatible fragments plausible to connect, and post-training favors compact backend-actionable summaries. We formalize the attacker-induced form as adversarial relinking and present Relink, an automated DSL-based tool that splits malicious payloads into benign fragments while keeping the complete payload absent before compression. Across four long-context agent benchmarks, Relink achieves 86.9% Relink Rate and Backend Action Rate versus 17.0% for clean-split controls. Existing defenses fail to reliably capture adversarial relinking; our KBRA defense reduces residual Backend Action Rate to 0.0%.

[AI-175] Entropy Objectives in Markov Decision Processes

链接: https://arxiv.org/abs/2606.21726
作者: S. Akshay,Raghav Goyal,Aditya Neeraje,Piyush Srivastava
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the problem of synthesizing control policies that enforce a concentration property on the state distributions of a stochastic system. We present a formalization of this problem in terms of synthesizing strategies for maintaining an entropy-based objective in Markov Decision Processes (MDPs). We first show that even relaxed versions of this problem are complexity-theoretically hard. We then present a sound and (conditionally) relatively complete method to verify and synthesize strategies for such entropy objectives. The main challenge is the non-linear nature of such objectives, and our approach addresses this by exploiting and combining ideas from convex duality and invariant synthesis. We also investigate the role of memory and randomization in ensuring entropy objectives. Finally, we implement our ideas to evaluate our approach empirically on a few illustrative benchmarks.

[AI-176] Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models

链接: https://arxiv.org/abs/2606.21672
作者: Tianyou Wang,Anson Lei,Joe Watson,Ingmar Posner
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 8 figures. Project page: this https URL

点击查看摘要

Abstract:Imitation learning has emerged as a powerful paradigm for learning visuomotor policies, but its generalisation and stability are limited by the scale and quality of demonstration data needed. A promising direction is to leverage more abundant but heterogeneous data sources, which differ in action space and often lack action labels altogether. Existing co-training approaches that combine heterogeneous data sources rely on heuristic and hand-engineered alignment techniques. In contrast, we argue that action representations should be grounded in prediction: actions that produce the same effect on the environment should share the same representation, regardless of their sources. To this end, we instantiate this principle by using a grounded latent-action world model (GLAM), a pair of generative models with a shared latent action space across data sources that is grounded by predicting future observations consistently across sources. This latent action space is used to train downstream behavioural cloning (BC) policies which map observations to latent actions and decode them back to robot actions, providing a paradigm for learning from heterogeneous data. Empirically, we demonstrate that GLAM successfully learns an aligned latent action space that facilitates action transfer across data sources with and without action labels. Across five manipulation tasks in simulation and in the real world, GLAM-aligned policies significantly outperform BC baselines and prior latent-action methods, achieving an average of +48% improvement in task success rate with the same data-scarce setting. Videos and code are available at this https URL.

[AI-177] Improving Text-to-Music Generation with Human Preference Rewards ICME2026

链接: https://arxiv.org/abs/2606.21670
作者: Yonghyun Kim,Junwon Lee,Haiwen Xia,Yinghao Ma,Chris Donahue
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICME 2026 Grand Challenge on Academic Text-to-Music Generation

点击查看摘要

Abstract:We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol’s FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.

[AI-178] When Is an LLM Worth It for Hyperparameter Optimization? A Budget-Matched Study on Tabular Data Finds the Warm-Start Is a Default Configuration Not the Model

链接: https://arxiv.org/abs/2606.21641
作者: Carson Rodrigues,Oysturn Vas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) have been proposed as hyperparameter-optimization (HPO) advisors that “warm-start” search from prior knowledge, proposing strong configurations in very few evaluations. We test that claim under a budget-matched, multi-seed protocol on eight PMLB tabular benchmarks, comparing an LLM advisor (LLM-OptFlow) against four classical baselines (random search, Optuna-TPE, Gaussian-process Bayesian optimization, and successive halving) over one shared search space, with paired tests and bootstrap 95% CIs across 8 x 5 = 40 (task, seed) units. The finding is cautionary. The advisor’s strong first point is not an LLM output at all: like prior LLM-HPO systems the loop is seeded with a fixed default configuration, evaluated before any model call, which alone reaches 88.7% mean best-CV, identical to within 0.01 pp across all seven advisor models tested. The LLM’s own proposals add only +0.40 pp of cross-validation accuracy over that seed and nothing on held-out test (LLM-Default = -0.01 pp, p = 0.92). When the same seed is granted to classical search, the apparent lead collapses: against seeded random search it leads by +0.20 pp at 2 evaluations, is tied by 5, and is behind by 12 (-0.37 pp). Without the seed, classical search ties the advisor by 12 evaluations and beats it by 40 (+0.6 to +0.8 pp, p = 1e-4). Two LLM-specific behaviors survive: a single-task exploration failure (vehicle), and a rule-based confidence filter that removes ~33% of wasted compute without changing accuracy. The recommendation is deflationary: on tabular HPO, seed classical search with a sensible default; an LLM advisor adds no measurable generalization benefit and is overtaken within a handful of evaluations. We release the harness and a script that reproduces every statistic.

[AI-179] Counsel: A Meta-Evaluation Dataset for Agent ic Tasks

链接: https://arxiv.org/abs/2606.21627
作者: Sashank Pisupati,Henry Broomfield,Eujeong Choi,Antonia Calvi,Charlie Wang,Roman Engeler,Max Bartolo,Patrick Lewis
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As agentic systems tackle increasingly complex multi-step tasks, evaluating their trajectories presents a major bottleneck - human annotation of a single trajectory on popular agentic benchmarks can take hours, making it difficult to scale evaluations for measuring performance or curating training data. This has driven widespread reliance on automated approaches such as LLM-as-a-judge (LLMJ) to critique agents at the process and outcome-levels at scale, however, the soundness of LLMJ critiques often goes unmeasured. Here, we introduce Counsel, the first public dataset of meta-evaluations for agentic tasks. Counsel consists of process-level critiques from open-weight LLMJs on two agent benchmarks: tau-bench (customer support agents) and DA-Code (coding agents), and human meta-evaluations of these critiques. Human annotators label critiques on each flagged error as “spot on”, “correct location but poor reasoning”, or “should not have flagged”, achieving reliable inter-annotator agreement (Krippendorff’s alpha of 0.78). The resulting dataset stratifies LLMJ critiques by human alignment across both error location within a trajectory and reasoning quality, serving as valuable data to calibrate, improve, or train LLMJs for agents. Comparing open-weight judges, we find that more capable judge models and more reasoning effort both enabled improved human agreement, with the strongest judge reaching ~88% agreement on location and ~65% on reasoning. Counsel is generated using open-weight models and is permissively licensed for broad community use, which we hope will enable rigorous study and improved alignment of LLM-based evaluators for agentic systems.

[AI-180] he Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning ALT ICML2026

链接: https://arxiv.org/abs/2606.21611
作者: Lucas Fagan,Michele Tarquini,Ali Shehper,Maksymilian Manko,Angus Gruen,Coco Huang,Giorgi Butbaia,Davide Passaro,Sergei Gukov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Group Theory (math.GR); Geometric Topology (math.GT)
备注: Accepted at ICML 2026. 38 pages, 9 figures. Code and datasets: this https URL

点击查看摘要

Abstract:Mathematical search problems present a unique challenge for Reinforcement Learning (RL) due to vast search spaces and sparse rewards. In previous works, the Andrews-Curtis (AC) conjecture was established as an illustrative example of such problems. In this work, we identify a critical structural barrier in the AC landscape: a “Two-Hump” distribution, where problem instances are either trivially solvable or effectively impossible, with a scarcity of intermediate “hard-but-solvable” instances required for effective learning. We tackle this challenge through two primary avenues: novel data generation techniques to populate the difficulty gap, and significant algorithmic enhancements including the introduction of supermoves and Transformer-based architectures. We demonstrate substantial performance improvements over previous baselines, and release new comprehensive benchmark datasets including AC-19 (125,192 AC-trivial presentations of varying difficulty with length at most 19) and AC-1M (1,136,154 hard AC-trivial presentations of length at most 30), the first large-scale, publicly available datasets of this kind.

[AI-181] FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning for Autonomous Driving

链接: https://arxiv.org/abs/2606.21587
作者: Bonan Wang,Letian Tao,Bin Shuai,Jiaxin Gao,Wenxin Zhao,Wei Xiong,Kehua Sheng,Bo Zhang,Yang Guan,Shengbo Eben Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning is pivotal for closed-loop autonomous driving yet remains constrained by severe bottlenecks in sampling efficiency. Standard parallel sampling mitigates this but suffers from the straggler effect, where the premature termination of a single environment necessitates a synchronized batch re-initialization, leading to suboptimal sample utilization and prohibitive re-initialization latency. To address this, we propose FAST, a synchronous parallel framework tailored for closed-loop simulation. Specifically, FAST employs Dynamic Parallel Sampling Alignment (DPSA) to maintain vectorization synchronization by extending terminated episodes via virtual continuation, thereby decoupling the sampling loop from individual terminations. By dynamically triggering global truncation based on the termination rate of parallel clips, FAST effectively eliminates the bottleneck of premature resets without sacrificing data diversity. Furthermore, to strictly preserve theoretical consistency, we incorporate a Scaled Mask-Padding Optimization (SMPO) that leverages validity masking and adaptive loss normalization to nullify the bias from auxiliary padding data. Empirical evaluations demonstrate that FAST achieves at least a 1.78 times wall-clock speedup over the single-clip baseline while preserving statistical unbiasedness.

[AI-182] Composing Verifiable Conceptual Models via Building Blocks: Towards Design-Time Verification of Agent ic AI Workflows

链接: https://arxiv.org/abs/2606.21565
作者: Noe Y. Flandre,Alexander C. Nwala,Philippe J. Giabbanelli
类目: Artificial Intelligence (cs.AI)
备注: To appear at the 2026 Winter Simulation Conference

点击查看摘要

Abstract:Agentic AI systems orchestrate multiple LLM-based agents through workflow architectures that coordinate decisions, tools, and external actions. While current platforms emphasize runtime safeguards, little support exists for verifying workflows during system design. From a Modeling \ Simulation perspective, this gap is analogous to composing conceptual models without verifying whether their building blocks interact coherently. We propose a design-time verification approach that models agentic workflows as compositions of reusable building blocks and checks their compatibility through twelve structural rules. We implemented these rules in a software prototype and evaluated them using two openly released datasets: 48 workflows with known design flaws and 168 variants that preserve workflow logic but alter graph structure. Results show that our verifier reliably detects violations even when flawed designs are obscured through structural transformations such as splitting tasks between agents. Future works could combine our verification with community repositories of building blocks to compose safe agentic workflows.

[AI-183] AI Alignment From Social Choice Perspectives

链接: https://arxiv.org/abs/2606.21550
作者: Daniel Halpern,Evi Micha,Ariel D. Procaccia,Benjamin Schiffer,Itai Shapira,Shirley Zhang
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in ACM SIGecom Exchanges

点击查看摘要

Abstract:Alignment from human feedback uses human judgments about model outputs to steer the behavior of language models after pretraining. When those judgments reflect conflicting views of desirable behavior, the learned objective becomes an aggregate determination of what the model should prefer. We survey recent work that has studied this aggregation problem through the lens of social choice theory. We illustrate how the social choice perspective helps identify failure modes in the feedback aggregation layer and reveals a broader design space for handling disagreement in explicit and principled ways.

[AI-184] Backpropagating Through Simulation: Analytic Policy Gradients for Sample and Learning Efficient Differentiable Continuous Control

链接: https://arxiv.org/abs/2606.21525
作者: Yueci Deng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model-free reinforcement learning algorithms such as Proximal Policy Optimization (PPO) treat the environment as a black box, estimating policy gradients from sampled rewards; this process demands millions of interactions and relies on high-variance advantage estimates. When environment dynamics are differentiable, the return is an end-to-end differentiable function of the policy parameters, enabling exact gradient computation via backpropagation through simulation. We term this approach Analytic Policy Gradients (APG) and evaluate it against PPO on four continuous control tasks of increasing dynamical complexity: a one-dimensional point-mass target-reaching task, a 2D point-mass navigation task with obstacle avoidance, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. Both algorithms share identical model architectures, observation normalization, and optimizer settings. To decouple sample efficiency from compute efficiency, we design a multi-axis evaluation protocol that records performance against environment steps and gradient steps. We report a segmented backpropagation scheme with MC and critic-based bootstrap modes that mitigates gradient degradation on long-horizon tasks, and present ablations over segment length and bootstrap strategy.

[AI-185] Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

链接: https://arxiv.org/abs/2606.21498
作者: Yuanhao Chiang,Hongbo Duan,Chunru Yang,Jiahua Pei,Yi Liu,Xueqian Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive text-to-image (T2I) generation has recently advanced rapidly, yet aligning generated images with human preferences remains challenging. GRPO-style online reinforcement learning provides an effective framework; however, existing methods typically treat reference-policy divergence as fixed, despite its direct impact on policy optimization. We study this overlooked factor within a unified f-divergence framework, encompassing forward KL, reverse KL, and JS divergence, for GRPO-style autoregressive T2I alignment. Our systematic theoretical analysis reveals that different divergences reshape token-level updates in distinct ways. In particular, under the sampled-token shaping form used, JS regularization achieves a favorable trade-off by mitigating uniform bias relative to the reference policy while still discouraging large deviations. Extensive experiments on LlamaGen and Janus-7B show that JS divergence achieves the strongest or highly competitive optimization performance on most evaluation metrics while maintaining favorable generation diversity. The code is available at this https URL.

[AI-186] Breaking chains with trees: Deep learning with mathcalO(log N) parallel time complexity

链接: https://arxiv.org/abs/2606.21497
作者: Neeraj Mohan Sushma,Aditya Nagarsekar,Cabrel Teguemne Fokam,Robin Schiewer,Amit Kumar Pal,Anand Subramoney,David Kappel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 20 pages, 2 figures, 12 tables

点击查看摘要

Abstract:Modern deep neural network architectures are trained via backpropagation, which requires errors to be sequentially propagated through all layers before parameters can be updated. This introduces two limitations: locking, where layer-wise updates are strictly interdependent and cannot proceed in parallel, and the weight transport problem, which requires symmetric forward and backward pathways for exact gradient computation. These constraints restrict parallelism, increase memory and communication overhead, and pose challenges for scalable learning. In this work, we propose Hierarchical Block-Local Learning (HBLL), a framework that decomposes deep neural networks into hierarchically linked blocks trained using local learning objectives derived from variational principles, eliminating the need for full end-to-end backpropagation while maintaining effective information propagation across the network. HBLL is the first algorithm that is able to train deep neural networks in \mathcalO(\log N) parallel time complexity, where N is the number of network layers. We show that HBLL implicitly defines a family of subnetworks corresponding to different hierarchical paths, enabling flexible inference with different effective numbers of layers. We evaluate HBLL on a set of challenging vision and language modeling tasks, achieving competitive performance. We also extend HBLL to recurrent sequence architectures, applying to settings that otherwise rely on backpropagation through time.

[AI-187] Predicting High-Risk Colorectal Polyps in African Americans Using Pre-Colonoscopy Clinical Features: Machine Learning Model Development and Temporal Validation

链接: https://arxiv.org/abs/2606.21492
作者: Basheer Qolomany,Mrinalini Deverapall,Adeyinka Laiyemo,Zaki Sherif,Mori Yuichi,Omer Ahmed,Hassan Brim,Hassan Ashktorab
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Risk stratification for advanced colorectal polyps typically relies on colonoscopy and/or pathology findings. However, there is growing interest in whether non-invasive features available prior to colonoscopy can help identify patients at higher risk. Such approaches may enhance clinical decision-making by prioritizing surveillance for individuals most likely to harbor high-risk polyps, when colonoscopy resources are limited while potentially reducing unnecessary procedures in lower-risk patients. Importantly, the use of non-invasive, pre-procedural information may also help promote more equitable access to risk stratification, particularly in settings where colonoscopy resources are limited or unevenly distributed. We aimed to develop and externally validate machine learning models to predict high-risk colorectal polyps using only non-invasive, pre-colonoscopy demographic, clinical, and behavioral features in a diverse, predominantly African American, urban cohort. We conducted a retrospective cohort study using demographic, lifestyle, and comorbidity data from patients who underwent colonoscopy at Howard University Hospital to develop and validate several machine learning models, including neural networks, random forest, support vector machines (SVM), Naive Bayes, logistic regression, decision trees, k-nearest neighbors (KNN), and XGBoost, for predicting high-risk colorectal polyps. High-risk polyps (HRP) were defined as villous or tubullovillous adenomas, high-grade dysplasia, polyps = 10 mm in size, and/or the presence of = 3 polyps per procedure; all other cases were classified as low-risk polyps (LRP). The dataset included 4,681 patients from 2015-2022 used for internal validation and 1,562 patients from 2023-2024 used for external validation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.21492 [cs.LG] (or arXiv:2606.21492v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.21492 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/s10620-026-10045-1 https://doi.org/10.1007/s10620-026-10045-1 Focus to learn more DOI(s) linking to related resources

[AI-188] owards Transparent Mental Health Insights: An Explainable AI Model for Career-Related Depression and Anxiety Among University Students Using Structured Data

链接: https://arxiv.org/abs/2606.21474
作者: Arsham Azam,Rasikh Ali,Tayyaba Farhat,Sheeraz Akram
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AHTBE 2025 Conference, published in Medical Sciences Forum (MDPI). 18 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Career anxiety and depression among university students present a growing challenge to mental health and academic achievement. This study proposes an Explainable AI (XAI) framework using multimodal data and Federated Learning (FL) to identify early indicators of career-related mental health problems in a privacy-preserving and culturally responsive manner. The framework combines structured behavioral data and facial emotion features from interview videos via an intermediate fusion neural network with attention mechanisms. Label smoothing was applied to improve model generalizability. FL was used across institutions to enable collaborative training without raw data sharing. Evaluation was conducted using the Student Mental Health Survey dataset from university students across Pakistan. Our model attained an F1-score of 89.12%, recall of 86.54%, accuracy of 92.08%, and precision of 91.88%. Using Integrated Gradients and SHAP, the model identified key behavioral markers of depression including avoidance of direct gaze, lower facial expressiveness, and social withdrawal, consistent with psychological theory. This research presents an interpretable, scalable, and context-sensitive AI system for mental health pre-diagnosis with potential integration into student support services globally.

[AI-189] AutoRAS: Learning Robust Agent ic Systems with Primitive Representations ICML2026

链接: https://arxiv.org/abs/2606.21445
作者: Yang Yue,Xuancheng Zhu,Yuyang Ma,Guoshun Nan,Zihan Dou,Jingru Shan,Congyu Guo,Ji Zhang,Hua Wang,Jingfeng Zhang
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026

点击查看摘要

Abstract:The automated design of agentic systems offers a promising pathway for scaling large language models (LLMs) beyond single-agent reasoning. While prior work has advanced task performance through handcrafted or automatically generated multi-agent workflows, robustness is often treated as an afterthought, leaving systems vulnerable to external adversaries and internal failures. We propose AutoRAS, a framework for the Automated design of Robust Agentic Systems. AutoRAS formulates system design as generating a sequence of symbolic primitives that jointly encode structural connectivity and behavioral actions, and learns to optimize this sequence using execution-derived safety signals and flow-based sequence-level objectives. Extensive experiments show that AutoRAS achieves the best performance in both vanilla and adversarial settings, with the smallest performance degradation under attacks. Further analyses demonstrate strong transferability, stable optimization behavior, stability across primitive sets, and favorable cost trade-offs. Our code is available at \hrefthis https URL\textthis https URL .

[AI-190] Does Mixture-of-Experts Actually Help Inference on Consumer and Edge Hardware? An Empirical Study

链接: https://arxiv.org/abs/2606.21428
作者: Alfarizy Alfarizy,Hung Truong Thanh Nguyen,René Richard,Roozbeh Razavi-Far,Hung Cao
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 tables, 4 figures. Submitted to FAIEMA 2026. Code available at this https URL

点击查看摘要

Abstract:Mixture-of-Experts (MoE) language models are often described as ideal for resource-constrained inference. Each token activates only a small subset of experts, so the per-token compute cost, in floating-point operations (FLOPs), resembles that of a much smaller dense model. Whether that FLOP advantage survives in practice is far less clear. We ask whether MoE models actually run faster and cheaper than comparable dense models on consumer-grade and edge hardware. We benchmark OLMoE-1B-7B (1.3 B active of 6.9 B total) against three dense baselines on an Apple M2 Pro and an NVIDIA Jetson Orin Nano 8 GB through this http URL, measuring throughput, memory, and on-device energy. The answer is device-dependent: OLMoE’s active-parameter advantage is only partly realised on the laptop (~10% behind the same-active Llama-3.2-1B) and erodes on the edge device (~31% behind, at 2.1 \times the energy per token, with peak memory at the 8 GB ceiling). Patching this http URL to time the decode graph node-by-node shows routing accounts for under 9% of MoE-block compute on the cleaner edge backend, so the gap reflects total-parameter memory footprint, expert dispatch, and KV-cache pressure rather than routing. The implication is that on bandwidth-bound edge hardware, inference cost tracks total parameters, not active ones, and sparse activation does not buy back what the device is constrained on. These findings are bounded to one MoE model at this parameter scale and two devices, and we release the full measurement harness and per-run data.

[AI-191] Dont Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

链接: https://arxiv.org/abs/2606.21409
作者: Chubin Zhang,Zhenglin Wan,Xingrui Yu,Pengfei Zhou,Wangbo Zhao,Jingxuan Wu,Yaxin Zhou,Ivor Tsang
类目: Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Tool-augmented agents are typically evaluated by their gains under reliable external feedback. Yet these gains leave open a key counterfactual: when feedback is unreliable, would the agent be better off receiving no task evidence? We study this question with a controlled matched-loop comparison that fixes the agent loop, prompt, action space, and decoding, while varying only the returned observation: faithful, misleading, or absent. Across question answering and fact verification, persistent misleading feedback produces a value inversion: agents that benefit from clean tools can perform worse than the matched no-feedback fallback. On HotpotQA, Qwen2.5-7B reaches 44.8 F1 with clean retrieval and 22.3 F1 with no feedback, but drops to 4.7 F1 under shuffled retrieval. The inversion persists under stronger clean retrieval and locally plausible distractors, but weakens when later clean evidence can repair the trajectory. Early trajectory signals predict many failures, yet simple repairs remain fallback-limited: rejecting bad evidence helps only when the exposed fallback is reliable. These results show that clean-tool gains can overstate tool value, and that matched no-feedback fallback controls are necessary for evaluating tool-augmented agents.

[AI-192] SwarmX: Agent ic Scheduling for Low-Latency Agent ic Systems

链接: https://arxiv.org/abs/2606.21401
作者: Yeqi Huang,Yanwei Ye,Guomin Chen,Wenhao Su,Bin Gong,Jialian Li,Zhan Lu,Yangshen Deng,Xuan Sun,Le Xu,Luo Mai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Agentic AI applications compose multiple model calls and tool executions, creating new scheduling challenges for GPU-CPU clusters. Their inference time and model-call structure often depend on prompt semantics, making conventional scheduling approaches ineffective for low-latency serving. This paper presents SwarmX, a system that implements agentic scheduling for low-latency agentic applications. SwarmX uses scheduling-specific neural predictors to capture prompt, device, runtime, and target-model features; exposes distributional predictions to routers and scalers for tail-aware decisions; and provides mechanisms for predictor training and online adaptation. These predictors and mechanisms are integrated into a scheduler-agent framework that provides a common substrate for integration with existing scheduling and model-serving infrastructure. We evaluate SwarmX using production deployment (nearly one thousand GPUs and one million CPU cores) and controlled experiments on a 128-GPU testbed. Across multi-agent code generation, deep research, and multimodal agentic workflows, SwarmX reduces tail latency by up to 61.5% compared to state-of-the-art schedulers and sustains up to 2x the throughput of production schedulers under the same SLO.

[AI-193] Calibration Is Not Control: Why LLM -Agent Oversight Needs Intervention

链接: https://arxiv.org/abs/2606.21399
作者: Chubin Zhang,Zhenglin Wan,Xingrui Yu,Jingxuan Wu,Qi Wen,Pengfei Zhou,Wangbo Zhao,Ivor Tsang
类目: Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:Runtime oversight for LLM agents is commonly framed as scalar risk prediction: estimate failure likelihood, confidence, or uncertainty, then intervene once the score crosses a threshold. We argue that this framing targets the wrong object for control. The relevant question is not how likely the agent is to fail if it continues, but whether an available intervention would improve the outcome. Two trajectory prefixes can have the same risk estimate while requiring different actions, because one remains recoverable and the other does not. We formalize this mismatch as target error and identify intervention advantage, the expected utility gain from intervening rather than continuing, as the decision object for oversight. To measure this mismatch, we introduce prefix branching, a same-prefix counterfactual protocol that executes candidate actions from identical trajectory states. Across four benchmarks, action-conditioned control yields regime-dependent gains over scalar routing. In a calibration decomposition, recalibrating the same scalar score improves prediction metrics but leaves control regret unchanged, showing that calibration alone does not repair target error. A simple prefix-only action-conditioned controller substantially reduces regret in the strongest interactive regime, from 0.506 to 0.110 on ALFWorld. Gains shrink when interventions are weak or when scalar routing already preserves intervention-relevant information. These results suggest that LLM-agent oversight should move from calibrated risk scoring toward action-conditioned value estimation.

[AI-194] Evaluating LLM s for Real-World Web Vulnerability Detection

链接: https://arxiv.org/abs/2606.21397
作者: Sebastian Neef,Luca Jungnickel,Antonio Benjamin Buchholz,Valene Spence,Vicente Birke Gonzalez
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To be published at AICCPS Workshop @ ARES 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a promising tool for automated vulnerability detection, yet their effectiveness on web-specific vulnerabilities remains to be explored. This work benchmarks six frontier (Claude Opus 4.6, Codex GPT-5.4, Gemini 3.1-pro-preview) and open-weight models (Qwen 3.5, Qwen 3 Coder Next, MiniMax M2.5) on their ability to detect real-world web vulnerabilities using static analysis in WordPress plugins, including SQL injection, stored cross-site scripting, path traversal, and remote code execution. Using five prompt designs of varying structure, scope, and complexity across three experiment iterations, we aim to answer how model and prompt choice affects vulnerability detection. Our results show that all models are capable of detecting valid security issues, but the detection rate varies depending on the model and prompt. For example, Claude Opus 4.6 achieved the highest web vulnerability detection rate (63%), while open-weight MiniMax M2.5 performs on par with other frontier models (48%), and self-hosted Qwen 3.5 only achieved 35%. We show that scoped prompts that narrow the vulnerability scope outperform open-ended ones, whereas the prompt complexity has little impact. Surprisingly, no model achieved full reporting consistency across three experiment iterations, with some as low as 50%. Our experiments demonstrate the opportunities and limits of LLM-based vulnerability detection, as no model correctly identified one baseline vulnerability in one of the plugins. Additionally, we derive practical lessons learned for security practitioners and publish all code and data to support future research. Comments: To be published at AICCPS Workshop @ ARES 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.21397 [cs.CR] (or arXiv:2606.21397v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.21397 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sebastian Neef [view email] [v1] Fri, 19 Jun 2026 13:02:47 UTC (132 KB)

[AI-195] Unsupervised Disentanglement Without Compromises : How Functional Orthogonality Enforces Identifiability ICML2026

链接: https://arxiv.org/abs/2606.21385
作者: Mathieu Cyrille Simon,Pascal Frossard,Christophe De Vleeschouwer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:This paper explores unsupervised disentangled representation learning from a functional perspective. We define latent concepts as factors that influence observations through locally orthogonal directions, formalized as an orthogonality constraint on the Jacobian of the generative mapping. We prove that this condition yields identifiability of general nonlinear generative models, without requiring statistical independence or causal assumptions, provided the latent domain admits all combinations of factor values. Experiments with orthogonality-regularized normalizing flows empirically confirm the theory, demonstrate reliable recovery of ground-truth factors, and shed light on the success of VAEs. These findings challenge the prevailing impossibility claims for unsupervised disentanglement and provide a principled alternative foundation.

[AI-196] LambdaMark: Semantic Audio Watermarking for Robustness and Radioactivity

链接: https://arxiv.org/abs/2606.21365
作者: Kexin Li,Xiao Hu,Ilya Grishchenko,David Lie
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative audio have made voice cloning increasingly effortless, enabling voice fraud, impersonation, and other forms of unauthorized use. A common attack finetunes a speech generation model on recordings of a target speaker, allowing the model to synthesize speech in that speaker’s voice. Audio watermarking offers a promising defense by embedding detectable signals into audio. A practical watermark must satisfy two key properties: robustness and radioactivity. Existing audio watermarking methods typically embed signals into low-level representations, such as waveforms or spectrograms, which makes them vulnerable to signal-level manipulations and limits their transfer to downstream models. We introduce LambdaMark – the first generic radioactive watermarking scheme. Unlike all previous approaches, LambdaMark achieves generic radioactivity by embedding multi-bit watermark information into semantic audio latent representations. Our watermarks have semantic interpretation and are thus more likely to be learned by a downstream model through finetuning. LambdaMark includes a lightweight watermark encoder to inject multi-bit message-dependent perturbations into semantic audio representations and a decoder to detect watermark presence and recover the embedded bit information. Encoder and decoder are trained using a custom multi-component loss that preserves fidelity of the watermarked audio, increases bit-level recovery rate, and improves robustness against common distortions and adversarial removal attempts. Experiments show that LambdaMark achieves near-perfect robustness under common distortions. LambdaMark is also the only watermark that is robust against all evaluated removal attacks. Furthermore, LambdaMark exhibits general and robust radioactivity and remains robust to distortions and adversarial removal attacks even on the generated outputs of those finetuned models.

[AI-197] SOHET: Sequence Of Heterogeneous Events Transformer with Self-Supervised Pre-Training

链接: https://arxiv.org/abs/2606.21356
作者: Kees Jan de Vries,Mustafa Radha,Mathijs de Jong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many machine learning applications rely on heterogeneous event streams to make predictions, either causally as events arrive or bidirectionally over complete sequences. We propose SOHET (Sequence Of Heterogeneous Events Transformer), a hierarchical architecture combining event-type-specific tabular encoders with temporal and type embeddings, processed by a causal or bidirectional transformer. We introduce three self-supervised pre-training objectives for the causal setting. On a proprietary large-scale real-world this http URL fraud detection task with 17 event types, SOHET outperforms FlexTPP, NAPPT, and CIPPT by 5.8%. Pre-training yields an additional 2.6% gain and 2.4% faster convergence. On the EBES benchmark, bidirectional SOHET matches or exceeds the published best on 6 out of 8 tasks.

[AI-198] Mind the Noise: Sensitivity of Transformer-based Interaction-Aware Trajectory Prediction Models to Noisy Data

链接: https://arxiv.org/abs/2606.21344
作者: Shahab Salehi,Luca Lusvarghi,Miguel Sepulcre,Javier Gozalvez
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Trajectory prediction allows autonomous vehicles to anticipate the future behavior of surrounding objects (or agents) and, accordingly, maximize the safety and efficiency of their driving. State-of-the-art Transformed-based interaction-aware trajectory prediction models, which rely on attention mechanisms to capture multi-agent interactions and maximize prediction accuracy, are commonly trained and evaluated on long-range high-quality datasets. These datasets are typically obtained by aggregating data from multiple vehicles or drones and removing any object detection or tracking noise offline. Yet, information about a surrounding object’s state (its position, speed, heading) is far from being noiseless in real-world deployments. Object state estimation is affected by perception uncertainties and localization errors that can be particularly large for objects received via Vehicle-to-Everything (V2X) communications. In this paper, we analyze the impact of noisy object state information on the trajectory prediction accuracy of a state-of-the-art Transformer-based interaction-aware trajectory prediction model. Our study demonstrates that trajectory prediction accuracy can rapidly deteriorate as the noise intensity increases. Numerical results show that the prediction accuracy can reduce by a 1.3x factor under small noise levels and by as much as a 3.9x factor under the highest (yet realistic) noise conditions. These findings reveal the strong sensitivity of trajectory prediction models to noisy data, underscoring the need for more realistic training and evaluation datasets as well as noise mitigation strategies.

[AI-199] DataClaw0: Agent ic Tailoring Multimodal Data from Raw Streams

链接: https://arxiv.org/abs/2606.21337
作者: Cong Wan,Zeyu Guo,Zijian Cai,Jiangyang Li,SongLin Dong,Lin Peng,Xiangyang Luo,Zhiheng Ma,Yihong Gong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Massive unstructured multimodal streams suffer from high “data entropy,” impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, \textDataClaw_0 -9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct \textDataClaw_0 -val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that \textDataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: this https URL

[AI-200] Ramanujan Graph Rewiring with Non Negative Resistance Curvature ECML KDD2026

链接: https://arxiv.org/abs/2606.21333
作者: Hugo Attali,Rachid El Jouhri
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECML PKDD 2026 (Research Track)

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful paradigm for learning on graph-structured data by iteratively propagating and aggregating information across edges. However, conventional message passing schemes often suffer from over-squashing, whereby exponentially large neighborhoods are compressed into fixed-dimensional embeddings, impeding effective long-range dependency learning. In this work, we introduce Ramanujan Propagation, a graph rewiring strategy that leverages Ramanujan graphs to alleviate topological bottlenecks in GNNs. We first establish that suitably chosen Ramanujan graphs guarantee non-negative resistance curvature, which mitigates over-squashing and facilitates efficient information flow. We then propose an algorithmic framework to construct a Ramanujan rewired graph that preserves the local connectivity of the original graph. Our experiments demonstrate that our method outperforms nine state-of-the-art rewiring techniques. These results establish Ramanujan graphs as a rigorous structural prior for scalable, topology-aware message passing in GNNs.

[AI-201] Social World Model for Lifelong Social Intelligence

链接: https://arxiv.org/abs/2606.21315
作者: Yu Luo
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Social intelligence is a core competency for language agents, yet current research primarily focuses on static capability evaluation rather than how these skills are continuously shaped and accumulated. This gap calls for a shift toward sustainable learning paradigms. Currently, two methodological pain points exist: social interaction trajectories lack unified structured representations to form iterable learning signals, and capability improvement and retention are typically studied in isolation, hindering the assessment of continuous evolution. To bridge this gap, we propose the Social World Model. We decompose social interaction into five dimensions (scene setting, observation, mental state, action, and dialogue) to build a closed-loop learning framework. In this setup, agents collect interaction experiences, convert them into preference signals for model updating, and redeploy the updated policy for continued learning. Additionally, we provide a reusable data synthesis mechanism and a lifelong learning benchmark, transforming social capabilities from an “object of evaluation” into an “object of sustainable training”. Validating our framework on the ASCENT-Bench, the interactively trained Qwen2.5-7B model outperforms its baseline across all five core metrics. Notably, it matches the closed-source Gemini 3 Flash in completion rate, exceeds it in pass rate, and achieves zero forgetting across three difficulty levels. Unlike prior works that merely report static comparisons or capability decay, this end-to-end approach provides a trainable, verifiable, and retainable pathway, demonstrating that small open-source models can sustainably acquire competitive social coordination capabilities. Comments: 13 pages, 2 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.21315 [cs.AI] (or arXiv:2606.21315v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.21315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-202] ask-Differentiated Atomic Skill Expansion and Routing for Continual Learning Across Highly Heterogeneous Tasks

链接: https://arxiv.org/abs/2606.21307
作者: Jiacheng Wang,Xinjia He,Qi Ding,Yutao Yang,Jie Zhou,Liyang Yu,Liang Dou,Qin Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning (CL) is commonly studied under the assumption that sequential tasks are semantically related or structurally similar. However, in highly heterogeneous settings, where tasks differ substantially in reasoning patterns and input-output formats, existing methods often suffer from catastrophic forgetting and inefficient capacity allocation. To address this challenge, we propose Task-differentiated Atomic Skill Expansion and Routing (\textttTASER), a CL framework that jointly determines how many new atomic skills to introduce for each task and which skills to activate. The framework first uses atomic skill incremental learning to dynamically expand capacity based on task divergence and model uncertainty. It then applies orthogonality-enhanced skill detection to ensure these skills remain semantically distinct and independently reusable. Finally, a skill dynamic routing mechanism composes task-relevant skills through lightweight task-conditioned gating. We further introduce \textttHeteroCLBench, a highly heterogeneous benchmark for CL, comprising 19 diverse tasks across 9 cognitive dimensions under a standardized sequential protocol. Experiments on \textttHeteroCLBench show that \textttTASER consistently outperforms strong baselines by improving plasticity and reducing catastrophic forgetting.

[AI-203] owards Dys-XAI: Influence-Based Explanations for Dysarthria Severity Assessment INTERSPEECH2026

链接: https://arxiv.org/abs/2606.21306
作者: Xiaoliang Wu,Qiyang Sun,Yupei Li,Erfan Loweimi,Jennifer Williams,Zhengjun Yue
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Dysarthria severity assessment is essential for therapy planning and longitudinal monitoring, yet manual perceptual rating is time-consuming and variable across clinicians. Although deep learning models achieve strong performance, their black-box nature limits clinical adoption. Existing speech explainability methods typically provide acoustic feature importance scores that are difficult for end-users to interpret. We propose an influence-based, instance-level explainability framework that explains each decision through supportive and competing training samples. Using gradient-based influence approximations, we compute per-utterance influence scores to identify supportive and competing training samples for each prediction. Controlled deletion experiments from 5 to 20 percent validate the explanations, showing that removing highly influential samples systematically shifts predictions. This approach provides auditable explanations by linking decisions to perceptible reference cases.

[AI-204] NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

链接: https://arxiv.org/abs/2606.21297
作者: Xinwei Liu(1),Junyuan Liang(1),Zicong Hong(2),Jianting Zhang(3),Wuhui Chen(1) ((1) Sun Yat-sen University, China, (2) EPFL, Switzerland, (3) Purdue University, USA)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Augmenting model-free reinforcement learning (RL) with representations learned through observation dynamics prediction (observation-predictive RL) can improve sample efficiency and performance, with minor modifications and limited additional computation. However, this approach still struggles in challenging tasks with low-dimensional observations. In this paper, we identify a key factor behind this problem: unbalanced reconstruction losses across observation dimensions, where dimensions with larger value ranges dominate the loss. This encourages the agent to neglect dimensions with relatively small ranges, leading to degraded performance. To address this issue, we propose a novel normalization method tailored to online RL, which normalizes low-dimensional observations and balances the resulting losses and gradients. Beyond balancing reconstruction losses, observation normalization enables dynamics prediction to be performed in a normalized observation space, thereby providing a unified treatment of low- and high-dimensional inputs (e.g., physical states and images). Building on this idea, we further introduce Normalized Observation Space Dynamics-Augmented Q-learning (NASDAQ), a framework for observation-predictive RL applicable across diverse domains. NASDAQ learns state-action representations by coupling value learning with two auxiliary tasks: short-term value prediction and next normalized observation prediction. Extensive experiments demonstrate that NASDAQ achieves competitive or superior performance compared with state-of-the-art model-based and self-predictive RL methods, while requiring significantly less training wall-time.

[AI-205] opological Neural Dynamics: A Neuron-wise Framework for Sequence Modeling

链接: https://arxiv.org/abs/2606.21295
作者: Borui Cai,Yao Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing sequence models, including RNNs, LSTMs, continuous-time networks, and Transformers, share a common structural principle: layer-wise dynamics, where all neurons in the same layer co-evolve through a shared parameterized operator, leaving individual neurons no freedom to evolve independently. Yet in many complex dynamical systems, rich global behavior emerges precisely from locally evolving units interacting through structured connectivity. Inspired by this principle, we introduce Topological Neural Dynamics (TND), a sequence modeling framework that shifts computation from layer-wise to neuron-wise dynamics. TND represents a neural system as a directed neuron graph, an interaction operator, and a local dynamics function, where each neuron evolves independently and collective computation emerges from interactions through the explicit graph topology. We instantiate TND as a discrete-time graph-coupled dynamical system and evaluate it as a case study on a behavior cloning task in single-player Pong. Compared with Vanilla RNN, Sparse RNN, LSTM, Closed-form continuous-time neural network (CfC), and Transformer baselines, TND achieves the best catch rate and a mean of 17.47 consecutive catches per round, more than three times that of the strongest baseline. These results suggest that shifting from layer-wise to neuron-wise dynamics provides an effective inductive bias for sequence modeling.

[AI-206] An Empirical Study of OpenPangu Quantization on Ascend NPUs

链接: https://arxiv.org/abs/2606.21257
作者: Tong Shi,Jiacheng Wang,Hui Xie,Ying Li,Aishan Liu,Jinyang Guo,Xianglong Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:OpenPangu models are attractive targets for private and domestic large-language-model deployment, yet their robustness under aggressive post-training quantization on Ascend NPUs has not been systematically characterized. This paper conducts a controlled empirical study of OpenPangu 1B and 7B models on Huawei Ascend 910B1 NPUs. We evaluate representative weight-only and weight-activation post-training quantization methods, including RTN, GPTQ, AWQ, SmoothQuant, GPTAQ, BiLLM, and SliM-LLM, under a unified calibration and evaluation protocol. Across 18 evaluation tasks, we find that 8-bit weight-only quantization is effectively lossless for both models, while 4-bit quantization remains practical for the 7B model but is visibly more harmful for the 1B model on reasoning, math, and code tasks. Ultra-low precision remains challenging: most 2-bit and binary settings collapse to near-random behavior, and W4A4 SmoothQuant produces non-finite perplexity in our evaluation. These results provide an NPU-oriented accuracy map for selecting OpenPangu quantization settings and highlight the persistent difficulty of extreme low-bit compression.

[AI-207] Recency/Frequency Adaptive KV Caching for Large Language Model Serving ICML2026

链接: https://arxiv.org/abs/2606.21238
作者: Yang Shen,Meghana Madhyastha,Robert Underwood,Bogdan Nicolae,Randal Burns
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM)

点击查看摘要

Abstract:Key-value (KV) caching is a powerful technique for accelerating large language model inference and generation. Inference workloads are large and diverse, which makes them difficult to cache effectively. Existing cache management strategies adopt the least-recently-used policy for evicting cache blocks. However, LRU leads to multiple unrelated workloads flushing each other’s caches. To address this, we integrate adaptive caching that dynamically allocates cache space between recently and frequently occurring KV blocks. Evaluations show that it improves the KV cache hit rate by up to 10.8% and reduces time to first token by up to 12.6% over naive vLLM on synthetic document question answering workloads, and 2.1% and 2.0% respectively on real-world conversation workloads. The method generalizes well to batch inference and demonstrates clear interpretability while effectively accommodating diverse workloads.

[AI-208] FleetAgent : Teleoperation Assistant for Autonomous Fleets via Vectorized V2N Messages

链接: https://arxiv.org/abs/2606.21222
作者: Juntong Peng,Qi Chen,Deyuan Qu,Takayuki Shimizu,Yaobin Chen,Ziran Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale autonomous fleets rely on teleoperation to resolve rare failures, yet streaming raw sensor data from many vehicles is costly, and remote operators can only monitor a limited number of vehicles at a time. We introduce FleetAgent, a cloud-hosted multimodal large language model (MLLM) assistant that consumes compact vectorized vehicle-to-network (V2N) messages, such as map elements, detected objects, and the ego planned path. It provides a structured natural-language response (including narration, explanation, and evaluation of the plan and scene), along with an intervention urgency score for operator prioritization. To make structured messages compatible with token-based MLLMs, we propose VecFormer, a vector-to-embedding interface with differentiable top-K context selection that bounds context length and GPU KV-cache growth, enabling more efficient batch processing, which is important under the context of cloud-hosted large-scale fleet management. We also construct VecEval, a nuScenes-derived dataset with paired human and synthetic imperfect plans and human-verified language labels, to facilitate the training and evaluation of our proposed system. Our proposed system can reduce uplink payload by up to 625 times compared with raw images and reduce KV-cache memory by 16.54 times compared with original text descriptions. On VecEval, FleetAgent improves Lingo-Judge score by 16.8% and reduces intervention failure rate by 19.9%, compared with Qwen2.5-VL-7B using language descriptions. These results demonstrate that FleetAgent can utilize compact structured V2N messaging to enable efficient, explainable teleoperation monitoring for autonomous fleets.

[AI-209] Whistleblowing and the machine – towards a considered position AAMAS-26

链接: https://arxiv.org/abs/2606.21201
作者: Marija Slavkovik,Liuwen Yu,Leon van der Torre,Reka Markovich
类目: Artificial Intelligence (cs.AI)
备注: Presented at AAMAS-26 Workshop on Rebellion and Disobedience in AI, see this https URL

点击查看摘要

Abstract:Artificial intelligent agents and autonomous systems are embedded in our environments. They are both a commercial product and a personal tool that generates a lot of data and can draw conclusions from it: machines generate and keep secrets. But should machines protect all secrets? It has been shown that artificial agents are able to whistleblow and it has been argued that digital multi-agent environments should allow for agents in them to whistleblow. We argue that machine whistleblowing must be normative and principled and routed in the existing understanding of whistleblowing as an important rule-breaking mechanism in society. We also argue that there is a need for government regulators to formulate an informed stance on both what machines should be allowed to whistleblow on and how to legally protect those who develop whistleblowing machines

[AI-210] F-SNO: Time-Frequency Gated Spectral Neural Operators for Learning Non-Stationary Partial Differential Equations KDD2026

链接: https://arxiv.org/abs/2606.21189
作者: Yitian Zhou,Chaoning Zhang,Zhenzhen Huang,Haoxuan Yu,Jiaquan Zhang,Yiran Li,Fan Mo,Kuien Liu,Jie Zou,Caiyan Qin,Yang Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Non-stationary partial differential equations (PDEs) arise throughout scientific computing, where the dominant frequency content and energy distribution can drift over time. While efficient in PDE solving, many spectral neural operators apply a shared spectral response across rollout stages, leading to mismatch with time-varying spectra in non-stationary systems. To address this issue, we propose Time-Frequency Gated Spectral Neural Operator (TF-SNO), a state-adaptive framework with learnable time-frequency gating inside spectral blocks. TF-SNO extracts compact frequency-domain and physical-space statistics from the current state to generate modulation coefficients, enabling the spectral response to evolve with the dynamics. TF-SNO learns temporal variation implicitly from the evolving state without introducing an explicit time dimension or time embedding, keeping the modeling complexity low. We further embed the adaptive operator blocks to accurately capture the multi-scale features, thereby improving long-horizon stability. Experiments on six non-stationary PDE benchmarks in 1D and 2D demonstrate that TF-SNO significantly reduces prediction errors and improves robustness compared to strong baselines, with particularly clear gains in long rollout, suggesting the effectiveness of state-dependent spectral adaptation in modeling non-stationary physical systems.

[AI-211] Inverting the Bellm an Equation: From Q-Values to World Models

链接: https://arxiv.org/abs/2606.21173
作者: Alistair Letcher,Mattie Fellows,Alexander D. Goldie,Jonathan Richens,Jakob N. Foerster,Oliver Richardson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 48 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Model-based and model-free reinforcement learning are traditionally viewed as separate paradigms: instead of learning a model of the transition kernel P , model-free agents typically estimate value functions tied to a specific policy and reward. In this paper, we challenge this dichotomy by proving that value-based agents trained on a sufficiently rich set of reward functions, e.g. using goal-conditioned RL, implicitly encode a unique and accurate world model. To extract this model in practice, we introduce \textit P -learning, an inverse analogue to Q -learning that samples from an agent’s Q -values, policies and rewards to decode its internal model of the environment. We then provide sufficient conditions on the type and number of goals for which agents encode the true kernel P , covering both stochastic and deterministic MDPs over finite or continuous state spaces. Even when our assumptions are violated, we empirically demonstrate that agents trained on a handful of reward functions encode accurate dynamics in \textttReacher , \textttMountainCar and stochastic variants of \textttFourRooms . Surprisingly, we find that policies trained exclusively on a \textttReacher agent’s implicit world model are quasi-optimal on out-of-distribution, velocity-based goals despite position-only training – suggesting that agents contain hidden generalisation capabilities and providing a new lens into the connection between model-based, model-free, and goal-conditioned RL.

[AI-212] An Exploratory Case Study of LLM -Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

链接: https://arxiv.org/abs/2606.21171
作者: Jan Wunderlich,Markus Kleffmann,Sebastian Lempert
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to support software development, but their practical usefulness in applied game-development settings remains underexplored, especially when generated code must be integrated into an existing game software system. This paper presents an exploratory empirical case study of GPT-4o in a custom Python/Pygame endless runner. The study examines six selected development tasks: three localized refactoring tasks and three tasks involving gameplay feature generation. The resulting implementations were evaluated using software metrics, unit tests, and manual gameplay assessments. In this case study, all three selected refactoring tasks were completed successfully in functional terms, whereas only one of the three selected gameplay feature generation tasks resulted in a correctly integrated feature. The findings suggest that, in this setting, GPT-4o handled localized transformations more reliably than tasks requiring new gameplay interactions across multiple existing systems. Given the exploratory single-case design, these results are best interpreted as indicative observations rather than as generalizable evidence of category-level model performance. Overall, the paper contributes a transparent case-based account of the opportunities and limitations of LLM-assisted refactoring and gameplay feature generation in an existing game software system.

[AI-213] rip: Benchmarking Agents in Personalized Interactive Travel Planning

链接: https://arxiv.org/abs/2606.21169
作者: Junle Chen,Wei Chen,Yehong Xu,Zhengjun Huang,Yuqian Wu,Zhoujin Tian,Kai Wang,Lei Wang,Xiaofang Zhou
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive travel planning has become a popular use case for language models. Agents are deployed to manage evolving preferences and unexpected disruptions over multiple turns. Such settings require models to make complex, profile-conditioned planning decisions. However, existing benchmarks often evaluate feasibility, personalization, or interaction in relatively isolated settings. We therefore introduce Trip+ to measure the ability of agents to plan travel holistically. In Trip+, given traveler profiles and dynamic interactions, agents must generate and revise minute-level itineraries. End-to-end traveler experiences are evaluated via an LLM-based simulator, enabling the assessment of subjective metrics like fatigue. Our scenarios range from simple request resolutions to complex environment-driven replanning. We evaluate 18 LMs and find a consistent gap in experiential quality. Models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.

[AI-214] OmniV2X: A Generative Foundation Planner for Efficient End-to-End Cooperative Driving

链接: https://arxiv.org/abs/2606.21165
作者: Juntong Peng,Juanwu Lu,Yupeng Zhou,Can Cui,Yaobin Chen,Ziran Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:We present OmniV2X, a generative foundation model for vehicle-to-everything (V2X) cooperative driving. The model directly interprets independent context sequences comprising multi-modal and multi-agent observations. The new design mitigates the computational cost of dense 3D perception, the vulnerability to data scarcity in cooperative scenarios, and the poor compliance with standardized messaging in existing methods that fuse multi-modal inputs into a shared representation. For training, we present an end-to-end supervised pipeline using a downstream trajectory generation loss, in which a high-capacity generative sequence planner implicitly learns to steer the model and leverage multi-modal inputs via cross-attention injection. As a foundation model, we demonstrate that OmniV2X pre-trained on large-scale single-agent planning datasets can efficiently adapt to cooperative environments by integrating the conditioning context with lightweight, standard-compliant V2X tokens. Evaluated on the DAIR-V2X-Seq dataset, OmniV2X outperforms existing end-to-end cooperative driving baselines, achieving state-of-the-art performance with less than 10% of the fine-tune V2X dataset and less than 1% of the communication bandwidth. We conduct comprehensive evaluations to demonstrate its computational efficiency and robustness under real-world constraints.

[AI-215] Context-Aware Generative AI for Automated Telecom Test Script Generation

链接: https://arxiv.org/abs/2606.21151
作者: Gautam Prasad,Chandramohan T. N.,Joy Bose
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Automated test generation for telecom software systems and networks has advanced significantly with the adoption of machine learning and rule-based approaches. However, most existing solutions generate static test suites against a snapshot of the system; as code, configurations, topologies, and key performance indicators (KPIs) evolve, these tests quickly become outdated or misaligned with the live system. There is currently no widely adopted solution that continuously detects fine-grained changes and selectively adapts only the affected tests without regenerating entire test suites. This paper presents a context-aware generative AI framework for automated telecom test script generation that treats testing as a continuously adapting process driven by the current state of the system rather than a static artifact. The central contribution is delta-conditioned test generation over a live knowledge graph: our approach employs a continuously updated knowledge graph (KG) as a single source of truth, a delta engine for fine-grained change detection, and a KG-guided generative AI agent, operating via the Model Context Protocol (MCP), to create, update, or retire test cases automatically. We further integrate Retrieval-Augmented Generation (RAG) to enrich reasoning with telecom-domain knowledge and historical artifacts. We demonstrate applicability across software-system and telecom-network use cases, including a Python-based KPI monitoring application managed in GitLab, and show how the framework reduces manual effort, improves test relevance, and accelerates test cycles.

[AI-216] AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?

链接: https://arxiv.org/abs/2606.21147
作者: Jiaxi Yang,Chaewan Chun,Jason Lucas,Yuchen Yang,Dongwon Lee
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Audio Language Models (LALMs) have demonstrated strong performance across a wide range of audio tasks. As they are increasingly deployed in real-world applications, ensuring their safety alignment has become more important. Although refusal mechanisms serve as a key safeguard by preventing LALMs from responding to harmful requests, they can also lead to \em over-refusal, where models incorrectly reject benign queries. This issue is especially challenging in the audio domain because speech that appears harmful in isolation may become benign when interpreted together with the surrounding acoustic context, such as background sounds. To study this problem, we introduce \textbfAOR-Bench (\textbfAudio \textbfOver-\textbfRefusal \textbfBenchmark), the first benchmark for over-refusal specifically designed for LALMs. AOR-Bench contains 3,000 pseudo-harmful audio samples across six scenario categories. Evaluating 12 representative LALMs from six major model families, we find that over-refusal is widespread (Figure~\reffig:overall_performance) and uncover several important patterns in their safety judgments. As a preliminary effort to mitigate this issue, we further explore two lightweight strategies (e.g., Chain-of-Thought and activation steering) to reduce over-refusal.

[AI-217] Agent Meter: Evaluating Model-CLI Matching for CLI-Based Local Task-Solving Agents

链接: https://arxiv.org/abs/2606.21140
作者: Han Chi,Jiaxin Qi,Yan Cui,Baisheng Lai,Jianqiang Huang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 5 tables

点击查看摘要

Abstract:LLM agents increasingly solve local tasks through command-line and CLI-based harness interfaces, including code editing, repository inspection, data analysis, and file workflows. Existing evaluations often emphasize task success, but deployed local agents are not models alone: the CLI mediates prompts, context replay, tool outputs, file access, terminal observations, and stopping behavior. As a result, the same model can produce different success, token, and cost profiles under different CLIs. We introduce AGENTMETER, a benchmark for evaluating model-CLI matching in CLI-mediated local task-solving agents, together with AgentMeter Score (AMS), a success-anchored, cost-aware metric over calibrated task-effort tiers. AgentMeter uses Benchmark90 as the full validation set and Core30 as a lower-cost subset for expanded comparison across 24 complete model-CLI configurations. On Core30, common deployment criteria select different configurations: highest Pass/30 selects GLM-5.1 with qwen-coder, lowest Tok./Pass selects GPT-5.3-Codex with kimi-cli, lowest billable USD/Pass selects Qwen3.6+ with Codex, while highest AMS selects Qwen3.6+ with kimi-cli. Benchmark90 validation preserves the Top-1 configuration and Top-3 set, with Spearman correlation 0.765, Kendall correlation 0.567, and AMS MAE 0.0383. These results show that model choice and CLI choice should not be decoupled, and that model-CLI configurations should be evaluated as the deployed unit.

[AI-218] PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

链接: https://arxiv.org/abs/2606.21139
作者: Youngjoon Jeong,Jihwan Yu,Minsoo Jo,Junha Chun,Taesup Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.

[AI-219] Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers

链接: https://arxiv.org/abs/2606.21130
作者: Zihan Yu,Xianling Zeng,Zhiming Xue,Yalun Qi,Sichen Zhao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of large-scale AI workloads, particularly Large Language Model (LLM) training and inference, is fundamentally reshaping the operational dynamics of hyperscale data centers. Unlike traditional cloud workloads, AI-driven jobs exhibit bursty, high-intensity, and rapidly shifting resource demands, often leading to sudden capacity stress that cannot be effectively handled by reactive threshold-based mechanisms. In this paper, we propose a deployment-oriented, burst-aware early warning framework for proactive capacity stress prediction under AI workload surges. We formulate the problem as a high-recall forecasting task over multivariate telemetry windows, with the explicit goal of enabling operational intervention before system degradation occurs. The proposed framework integrates workload intensity, temporal variation, and system pressure signals, and employs a lightweight tree-based learning model to capture nonlinear interactions in highly imbalanced environments. To evaluate the system under realistic conditions, we introduce an AI workload surge injection methodology that simulates burst-driven demand patterns observed in large-scale AI systems. Our XGBoost-based model achieves an ROC AUC of 0.697 and an AP of 0.670, significantly outperforming baseline methods. Under deployment-oriented threshold selection, the framework achieves a Recall of 0.914, enabling the detection of the majority of stress-prone periods with acceptable false-alarm cost. Beyond predictive performance, we show how the proposed framework can be integrated into operational control loops to support proactive actions such as workload throttling and resource scaling. Our results highlight the practical value of high-recall, learning-based early warning systems in enabling resilient and adaptive data center operations in the era of AI-driven workloads.

[AI-220] Chem2Gen-Bench: Benchmarking Chemical-to-Genetic Translation in Perturbation Response Space

链接: https://arxiv.org/abs/2606.21109
作者: Yuxiang Lin,Ying Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Virtual-cell and perturbation models are increasingly used to predict cellular responses for biomedical discovery, but chemical and genetic perturbations are not automatically interchangeable. Existing evaluations often study chemical response prediction or genetic perturbation prediction separately, leaving target-matched chemical-to-genetic translation under-tested. We introduce Chem2Gen-Bench, a benchmark comprising 260,084 chemical and 1,099,045 genetic perturbation profiles organized into cell-target contexts, and evaluate pairwise alignment, retrieval, protocol covariate associations, feature spaces, and foundation-model embeddings. Across matched contexts, translation fidelity is measurable but heterogeneous; background adjustment increases the association between pairwise similarity and retrieval success, while paired tests show lower mean retrieval success after adjustment under the evaluated settings. In a target-matched K562 audit, the evaluated foundation-model embeddings did not consistently improve over gene-delta baselines. Chem2Gen-Bench provides an auditable framework for testing when chemical and genetic perturbations align around shared targets and when representation gains are supported by matched perturbation evidence.

[AI-221] SLeDGe: Semi-Supervised Learning on Data Streams with Graph Structure Learning

链接: https://arxiv.org/abs/2606.21096
作者: Heechan Moon,Kijung Shin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) on data streams is challenging due to the continuous evolution of high-volume data and the scarcity of labels. Existing methods are limited in leveraging the intrinsic relationships among samples because they typically rely on fixed similarity measures or static graph structures, which cannot capture how relationships evolve over time. We propose SLeDGe, an SSL method for data streams that jointly learns a predictive model and an adaptive graph structure under strict memory and label constraints. SLeDGe maintains compact labeled and unlabeled memories using distinct update strategies, balancing rapid adaptation to novel features with the retention of historical consistency. In addition, by encouraging sparsity in the relational graph, SLeDGe filters out spurious connections and enables effective propagation of label supervision. Across 12 datasets, SLeDGe outperforms state-of-the-art competitors, achieving average relative accuracy gains of 31.7% with 0.1% labels and 14.8% with 1% labels.

[AI-222] Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training

链接: https://arxiv.org/abs/2606.21090
作者: Jianzhe Lin
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages

点击查看摘要

Abstract:Self-improvement can self-regress. In REINFORCE post-training for code, a model can quickly improve on its optimized metric and then collapse within the same training campaign. We study this in a controlled multi-seed testbed using Qwen-2.5-3B and Qwen-2.5-7B, trained on competitive-programming tasks with binary CodeGrader reward across 10 sequential 20-step campaigns. Across campaigns, pass@1 shows a robust rise-then-collapse pattern: it peaks within tens of gradient steps and then falls back, sometimes to near zero. This is not cross-task catastrophic forgetting, but within-task policy over-optimization on a fixed distribution; KL- and EWC-style constraints do not prevent it. We ask where the control loop should sit. We compare three levels: CARE, a between-campaign memory mechanism with a capability posterior, transfer gate, and regression-aware belief revision; ES, a within-campaign early-stop rule that rolls forward the peak checkpoint and sets the next budget to peak_step+3; and GRPO, which changes the RL update using group-relative reward normalization. The answer is regime-dependent. On Qwen-2.5-3B, where naive REINFORCE is fragile, CARE v2 nearly doubles end-of-chain pass@1 from 4.9% to 9.5%, with paired bootstrap 95% CI [+0.4,+8.9] and gains in 4/5 seeds. On Qwen-2.5-7B, CARE reaches parity with naive REINFORCE, 13.8% vs. 11.8%, while ES reaches 22.2% [14.1,28.0]. Out-of-the-box GRPO reaches 20.7% [15.7,25.1], nearly matching REINFORCE+ES. GRPO raises the floor but does not remove the cliff. Its 7B gain mainly comes from better between-campaign carryover, while the within-campaign peak-to-end gap remains about 17 points under both REINFORCE and GRPO. GRPO+ES gives mixed evidence: 2/3 seeds improve, but one final cliff lowers the mean to 17.0% [0.0,28.1]. A Gemma-3-4B pilot shows the same signature, suggesting the phenomenon is not limited to Qwen. Comments: 31 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.21090 [cs.AI] (or arXiv:2606.21090v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.21090 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jianzhe Lin [view email] [v1] Wed, 17 Jun 2026 18:03:06 UTC (65 KB)

[AI-223] Repeated post-training is not Self-improving: Diagnosing Scientific Amnesia in Continual DPO Pipelines

链接: https://arxiv.org/abs/2606.21089
作者: Jianzhe Lin,Fei Wang,Xiaolin Li,Rajeshkumar Golani,Jubin Chheda
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Industrial LLM teams often ship behavior updates by repeatedly DPO-training a base model on sequences of related preference-data campaigns. The dominant failure mode in this regime is not always classical catastrophic forgetting: a pipeline may preserve previously learned behaviors while still failing to accumulate reusable methodological knowledge about how to train the next campaign. We call this failure mode scientific amnesia. This paper turns that practitioner intuition into a measurable industrial problem. We contribute: (i) a diagnostic suite for amnesia, (ii) a Program-based pipeline that chains FSDP-sharded DPO checkpoints across Qwen2.5-7B-Instruct runs, (iii) a 30-campaign HumanEval subdomain benchmark, and (iv) a comparative diagnostic study of five strategy proposers: random memory, rule-based scheduling, retrieval-only memory, warm-start Bayesian optimization, and MSCL, a meta-scientific memory and reasoner candidate. Across a single-seed 5-condition * 3-step real-LM chain, 4 of 5 candidates degrade in step-level peak pass@1, including MSCL; only the deliberately conservative rule-based schedule improves. Follow-up pilots qualify rather than overturn this finding: in a heterogeneous chain, MSCL is the only completed candidate that improves, whereas in a small multi-seed homogeneous sweep, retrieval-only has the best mean Delta and no pairwise candidate gap is statistically distinguishable. The contribution is therefore diagnostic, not a claim that MSCL solves the problem: scientific amnesia is observable in a production-like continual-DPO pipeline, and conclusions about interventions depend sharply on chain regime, evaluator design, and seed coverage.

[AI-224] Sim2O: Efficient Offline-to-Online MARL via Joint Action Composition

链接: https://arxiv.org/abs/2606.21085
作者: Bingchang Song,Yiqin Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline-to-online adaptation serves as a pivotal paradigm for mitigating the prohibitive cost of online exploration by bootstrapping reinforcement learning from offline datasets. While this paradigm has been extensively studied in single-agent settings, its extension to Multi-Agent Reinforcement Learning (MARL) remains largely unexplored, despite its critical relevance to complex coordinated decision-making. To bridge this gap, we introduce Sim2O, an elegant and minimalist framework for offline-to-online MARL. Rather than treating adaptation as a monolithic joint decision, Sim2O conceptualizes it as a compositional process. Specifically, candidate joint actions are synthesized by dynamically blending offline and online action proposals across agents. By leveraging a centralized value function to evaluate these hybrid combinations, Sim2O identifies high-value coordination strategies without requiring auxiliary training objectives or structural overhead. Empirical evaluations across diverse benchmarks demonstrate that Sim2O significantly outperforms existing baselines, underscoring that a minimalist design is not only viable but highly effective for multi-agent offline-to-online adaptation.

[AI-225] Coherence Under Commitment: Probing Generalization and Vacuous Memorization in LLM Logical Reasoning

链接: https://arxiv.org/abs/2606.21083
作者: Noor Islam S. Mohammad,Mahmudul Hasan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) deployed for logical reasoning in knowledge-intensive domains exhibit a subtle but critical failure: coherence can be vacuously achieved through systematic abstention. A model that withholds commitment to either entailment or refutation satisfies negation consistency while providing no utility. We introduce Coherence Under Commitment (CUC), a dual-query evaluation paradigm that jointly measures consistency and decisiveness. CUC contributes three innovations: (1) a commitment score c(\varphi) = p(\varphi) + p(\lnot\varphi) quantifying probability mass allocated to decisive outcomes; (2) a \textbfdeterministic elicitation protocol via normalized YES/NO log probabilities, eliminating sampling variance; and (3) a 3-way decision framework (True/False/Uncertain) operationalizing the coherence-commitment trade-off into metrics. Experiments on four open-weight LLMs (1B-3B) across 204 FOLIO examples expose a sharp frontier. Qwen2.5-3B achieves near-zero contradiction ( \mathbbE[v_\mathrmneg]=0.025 ) but only 7.4% coverage, while TinyLlama-1.1B reaches 79.4% coverage with violations on every example. Coherence-only evaluation would rank the abstaining model first; CUC exposes this as vacuous, and the frontier generalizes to LogiQA~v2 ( \rho=0.97 ). We argue that evaluation must report both coherence and non-vacuous commitment and release a toolkit for standardized assessment.

[AI-226] An Efficient and Effective Architecture for Large-Scale Traffic Prediction via Geometry-Adaptive Square Partitioning

链接: https://arxiv.org/abs/2606.21072
作者: Yongfeng Su,Hongwen Li,Zijian Zhang,Ziquan Fang,Lu Chen,Christian S. Jensen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic prediction is a core task in intelligent transportation systems and urban-scale decision making. Despite the effectiveness of mainstream neural-network based methods, their deployment in real-world settings with thousands of traffic sensors is jeopardized severely by their poor computational scalability. To address this, the community has attempted to incorporate spatial database partitioning techniques (e.g., Grid, Quadtree, and K-D Tree) to improve model scalability. However, these approaches rely on handcrafted geometric heuristics and often produce irregular or imbalanced data partitions, leading to boundary fragmentation, excessive padding overheads, and degraded model accuracy. In this paper, we propose SqLinear, an efficient and effective architecture for large-scale traffic prediction. First, we design Square Partition, a geometry-adaptive algorithm that partitions massive traffic sensors into balanced, non-overlapping, and near-square spatial regions. Unlike existing heuristic-based designs, Square Partition is theoretically grounded and provides provable guarantees on aspect ratio, balance, and partition utilization, establishing a high-quality foundation for downstream spatiotemporal modeling. Next, we propose a Hierarchical Linear Interaction (HLI) module that abandons the costly attention mechanisms commonly used in Transformer-based spatio-temporal models. HLI efficiently captures both local intra-region dynamics and global inter-region dependencies through a lightweight linear interaction scheme, enabling effective spatiotemporal modeling with linear computational complexity. Extensive experiments on four large-scale traffic datasets and 10 baselines show that SqLinear reduces MAE by 2.30% on average under standard setting and by 5.81% under extreme scalability settings, while reducing training runtime by 13.27%–30.84% in spatial- and horizon-scaling scenarios.

[AI-227] Local LLM Agents Agent s as Vulnerable Runtimes:A Source-Code Audit of the Agent Runtime Layer

链接: https://arxiv.org/abs/2606.21071
作者: Zhengsong Zhang,Zongze Li,Jiawei Guo,Haipeng Cai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Local LLM agents such as OpenClaw and Nanobot run on end-user machines and act on host resources - the shell, filesystem, browser, stored credentials, and messaging applications - through natural-language goals. These agents have become privileged software runtimes that mediate between user intent, model outputs, and host-level actions. Existing research characterizes the landscape through prompt injection, malicious skills, marketplace risks, or black-box evaluation of agents. But the implementation layer that performs this mediation, the prompt builder, parser, tool dispatcher, skill loader, memory writer, network client, and permission gate, has remained an unexamined safety boundary. To our knowledge, no prior work has examined the agent’s source tree to audit these components for implementation-level security weaknesses. We present CLAWAUDIT, a static auditing framework for measuring vulnerability exposure in local LLM agent runtimes. CLAWAUDIT derives a five-category vulnerability taxonomy from STRIDE and develops custom static-analysis rules that target agent-specific patterns absent from established rule sets for vulnerability analysis. We instantiate the taxonomy in two backends, 47 Semgrep YAML rules and 30 CodeQL queries, and evaluate on OPENCLAWBENCH, a benchmark of 446 source-code-level advisories from the OpenClaw repository and split temporally into 229 rule-derivation (train) and 217 held-out (test) advisories. On the held-out test, CLAWAUDIT raises Semgrep recall from 21.7% (Pro baseline) to 66.8%, and CodeQL recall from 13.8% (security-extended) to 75.1%. Train/test gaps remain within 4 percentage points for all four configurations, indicating that the rules generalize to vulnerabilities unseen during rule writing. A preliminary live-code audit shows that these recall-oriented rules require manual triage, motivating semantic filtering before production deployment.

[AI-228] Imitation Learning for Elder-Facing Speech Synthesis INTERSPEECH2026

链接: https://arxiv.org/abs/2606.21053
作者: Dongrui Han,Weidong Chen,Jiawen Kang,Mingyu Cui,Helen Meng,Xixin Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: accepted by Interspeech 2026

点击查看摘要

Abstract:Recent advances in text-to-speech (TTS) synthesis have achieved highly natural and expressive speech generation. However, these systems are designed for general adults and overlook older adults’ speech comprehension needs due to age-related sensory and cognitive decline. Prior work involves older adults by collecting preference feedback to tune model parameters. However, obtaining sufficient preference data is costly and difficult, as older adults quickly become fatigued during collection. In this paper, we propose a novel imitation learning (IL) framework to learn TTS models from expert demonstrations. We further improve Group Relative Policy Optimization (GRPO) with two-stage on-policy reward learning (OPRL) to mitigate reward hacking under limited supervision from expert demonstration. Experimental results show that GRPO w/ OPRL outperforms GRPO and supervised baselines in objective and subjective metrics. Audio samples are available at this https URL

[AI-229] Backdoor Attacks on Speech Emotion Recognition via TTS-Generated Poisoning

链接: https://arxiv.org/abs/2606.21052
作者: Yongbin Huang,Xihao Xie,Jia Zhang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by IEEE Cyber AI 2026. This is the author preprint version

点击查看摘要

Abstract:Speech Emotion Recognition (SER) systems increasingly leverage self-supervised acoustic representations, yet their vulnerability to training-time attacks remains largely underexplored. This paper presents the first systematic study of poisoning-based backdoor attacks on SER, with a focus on threats enabled by text-to-speech (TTS) generated audio. We introduce a stealthy, low-energy acoustic trigger that can be embedded imperceptibly into both natural and synthetic speech, enabling scalable and consistent poisoning. Our experiments demonstrate that SER models can be reliably compromised with high attack success rates under low poisoning ratios, while maintaining near-clean performance on benign inputs. We further show that backdoor patterns exhibit strong cross-model transferability and that self-supervised representations are particularly susceptible to learning these triggers. These findings reveal that TTS technology dramatically lowers the barrier to effective backdoor attacks, exposing critical vulnerabilities in modern SER pipelines and motivating the urgent need for dedicated defenses.

[AI-230] Structure-Aware Graph Multi-Task Learning for Dynamic Sparse OD Demand Prediction

链接: https://arxiv.org/abs/2606.21022
作者: Ming Xu,Jiawei Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Origin-Destination (OD) demand prediction is fundamental to intelligent transportation systems, yet real-world OD flows are often dynamically sparse, long-tailed, and characterized by heterogeneous zero-flow patterns. These properties make it difficult to distinguish whether an OD connection is active from how much demand it generates once activated. Many existing methods primarily treat OD prediction as a single flow regression task, which limits their ability to model low-frequency, intermittent, and long-tailed OD interactions. To address these challenges, we propose SAGMTL, a Structure-Aware Graph Multi-Task Learning framework for dynamic sparse OD demand prediction. SAGMTL decomposes OD prediction into structural state modeling and flow intensity estimation, jointly learning regional activity states, OD connection activity, and edge-level flow intensity within a unified framework. Specifically, a node-edge collaborative representation module captures regional semantics, temporal dynamics, and spatial priors through interactive node-edge updates, producing structure-aware representations for dynamic OD interactions. Based on these representations, SAGMTL estimates OD flows by jointly modeling stable demand patterns and short-term fluctuations. A multi-constraint objective further improves sparsity awareness and structural consistency. Experiments on three real-world urban mobility datasets from Beijing, Chengdu, and Nanjing show that SAGMTL achieves superior overall performance compared with state-of-the-art baselines. Further analysis demonstrates that explicitly modeling regional activity, connection states, and flow intensity improves the robustness of dynamic sparse OD demand prediction.

[AI-231] LK Jam: System Architecture and Implementation of a Real-Time Human-AI Interactive Music Generation System using Role-Aware GRU

链接: https://arxiv.org/abs/2606.21018
作者: Yakun Liu,Zhiyu Jin,Dong Liu,Hai Luan
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 10 figures, 3 tables. This is an original technical report on real-time human-AI interactive symbolic music generation VST3 plugin based on GRU and JUCE. The source code is open-source on GitHub

点击查看摘要

Abstract:As artificial intelligence advances into the era of Embodied AI, live musical interaction urgently needs to break free from the limitations of offline, unidirectional generation, achieving a “virtual synergy” capable of low-latency, dynamic interplay. To address this, this technical report presents LK_Jam, a real-time, bidirectional human-computer interactive music generation system based on a lightweight Gated Recurrent Unit (GRU) and a high-performance audio host architecture. In the algorithmic representation layer, this system abandons the computationally expensive fixed time-grid. Instead, it constructs a multi-dimensional sparse event stream integrating time-shifts, continuous harmonic embeddings, and role-aware encoding, enabling the model to accurately capture turn-taking logic and micro-timing in a single-step inference. In the engineering implementation layer, this paper builds a strict multithreaded lock-free communication bridge using C++ and the JUCE framework, incorporating the RTNeural inference engine designed specifically for real-time audio. By utilizing compile-time network topology solidification and a zero-allocation (allocation-free) mechanism, the end-to-end overhead of autoregressive decoding is strictly locked at (O(1)) complexity, structurally mitigating the risk of audio thread dropouts in DAW plugin environments. Furthermore, this study designs a three-stage progressive training strategy, achieving a leap from basic chord harmonization to expert-level interaction. Preliminary observations and architectural analysis demonstrate that while ensuring musical coherence and interactive role-play, the proposed system successfully challenges extreme real-time engineering constraints, offering a highly robust and deployable technical paradigm for next-generation AI co-performers in live music.

[AI-232] he AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value

链接: https://arxiv.org/abs/2606.21015
作者: Vishal Srivastava,Tanmay Sah
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures. Conceptual framework paper introducing the AI Evaluability Gap, Evaluability as evidence sufficiency for governance decisions, Operational Certification, Investment Certification, and a six-property evidence lifecycle for AI governance

点击查看摘要

Abstract:Organizations deploying AI face two fundamental governance challenges: managing AI risk and sustaining AI value. Both depend on evidence whose sufficiency cannot be taken for granted. We call the shared underlying challenge the AI Evaluability Gap: the condition in which organizations lack sufficient evidence to support high-confidence governance decisions regarding either risk or value. We argue that this gap reflects a category error in current practice. Existing governance approaches focus primarily on properties of systems, such as safety, fairness, reliability, compliance, and value, while paying comparatively little attention to the evidentiary foundations required to justify decisions about those properties. We further argue that AI governance encompasses both operational decisions regarding whether a system may operate and investment decisions regarding whether it merits continued organizational resources. To address this problem, we introduce Evaluability, defined as the capability of a system to generate, maintain, and renew evidence sufficient to support high-confidence governance decisions over time. We formalize governance decisions as functions of calibrated confidence Conf(D|E) and identify six properties of evaluable evidence: observability, attributability, intervenability, verifiability, calibration, and temporal validity. The framework distinguishes Operational Certification, which relies primarily on structural evidence to justify deployment decisions, from Investment Certification, which relies primarily on causal evidence to justify continued resource allocation. We argue that evidence sufficiency is a missing layer of AI governance and that closing the AI Evaluability Gap is a prerequisite for both managing risk and sustaining value in AI-enabled organizations. Comments: 24 pages, 9 figures. Conceptual framework paper introducing the AI Evaluability Gap, Evaluability as evidence sufficiency for governance decisions, Operational Certification, Investment Certification, and a six-property evidence lifecycle for AI governance Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T07 ACMclasses: I.2.11; K.6.5; J.1 Cite as: arXiv:2606.21015 [cs.AI] (or arXiv:2606.21015v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.21015 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vishal Srivastava [view email] [v1] Fri, 19 Jun 2026 00:58:01 UTC (27 KB)

[AI-233] Agent ic Time Machine as an Infrastructure for Future-Event Forecasting

链接: https://arxiv.org/abs/2606.21013
作者: Jingyi Chai,Bingyang Zheng,Xiangrui Liu,Hao Lu,Zihang Zhou,Tianchen Wang,Kemeng Zhang,Siheng Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting future events is a critical challenge for large language model (LLM) agents, spanning domains from elections and monetary policy to financial markets. However, evaluating progress on this task presents a fundamental trade-off between efficiency and environment fidelity. While live evaluation benchmarks suffer from an inherently slow feedback loop, existing retrospective replays typically restrict agents to static, pre-frozen databases that sacrifice the environmental realism of actual deployments. To tackle this issue, we introduce Agentic Time Machine ™, an infrastructure that approximately reconstructs the web state at any chosen past time by filtering post-cutoff content. Leveraging this evaluation infrastructure, we further propose a planner-solver-aggregator multi-agent framework that breaks each question into diverse analytical angles, gathers evidence in parallel, and combines the results into a single forecast. Experiments show that offline scores under TM correlate strongly with live FutureX scores, validating that TM offers a fast and reliable sandbox for forecasting-agent evaluation. On FutureX-Past and Polymarket evaluated under TM, our framework achieves the highest score among strong closed-book, tool-augmented, and self-consistency baselines. On the official FutureX live leaderboard, our system achieves the best average rank over four consecutive weeks, including 1st place in May Week 1. As of June 17, it also ranks 1st on FutureX’s official eight-week overall leaderboard.

[AI-234] Closure of Self-Determining System Based on Causal and Constitutive Relations

链接: https://arxiv.org/abs/2606.21010
作者: Yoshiyuki Ohmura,Earnest Kota Carr,Yasuo Kuniyoshi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A self-determining system is defined as one in which causes originating within the system influence the system itself. This definition raises the question of how to specify system boundaries. Although the concept of “closure” is commonly used for this purpose, defining boundaries solely in terms of causal relations introduce challenges, such as how to handle external causes and circular causality. To address this issue, we introduce two types of asymmetric relations: causal and constitutive. We propose that system boundaries can be defined as closures of loops formed by these relations, referred to as causal-constitutive loops. By constraining constitutive relations, the resulting system necessarily includes internal causes and thereby satisfies self-determination. Furthermore, to prevent reduction to supervenience, constitutive relations must involve at least two independent variables. This minimal requirement leads to two interdependent loops, which implies a dual-process organization.

[AI-235] BioInsight: Multi-Agent Orchestration for Interactive Biomedical Knowledge Discovery

链接: https://arxiv.org/abs/2606.20997
作者: Jieyi Wang,Bingxuan Li,Nanyi Jiang,Desong Meng,Zirui Fan,Yuxin Guo,Jiayu Liu,Kunlun Zhu,Eddie Yang,Xiusi Chen,Pan Lu,Bingxin Zhao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Biomedical researchers increasingly use AI-generated analyses and reports to interpret protein-level signals, but static outputs are often insufficient for research decision-making, where users need to inspect evidence, assess uncertainty, compare mechanisms, and refine hypotheses. We present \textscBioInsight, a multi-agent system that moves from static biomedical report generation to interactive evidence-centered interactive interface generation. Given a disease name, a protein association table, and optional cohort metadata, BioInsight organizes disease-specific evidence through typed intermediate artifacts, including ranked pathways, literature evidence packets, protein-level reasoning notes, citation-grounded reports, dashboard schemas, and rendered interactive interfaces. The system decomposes evidence retrieval from mechanistic reasoning, normalizes citations through deterministic components, and converts the same structured evidence used in the report into an interactive interface. We evaluate BioInsight on standardized biomedical QA, challenging protein-function reasoning, and end-to-end biomedical evidence synthesis. Results show that BioInsight achieves best, and suggest that biomedical AI systems should move beyond text-only and static reports toward provenance-preserving, interactive evidence artifacts.

[AI-236] xt-to-Image Generative AI for Modeling and Simulation: Methods Opportunities and Applications

链接: https://arxiv.org/abs/2606.20991
作者: Philippe J. Giabbanelli
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注: To appear at the 2026 Winter Simulation Conference

点击查看摘要

Abstract:Text-to-image generation is a form of generative artificial intelligence (GenAI) that converts textual descriptions into images. Most applications of GenAI in modeling and simulation (MS) have focused on large language models for documentation, coding, or explanation. By contrast, the potential of image generation remains largely unexplored. This tutorial introduces text-to-image generation to the MS community and details how it can support several MS tasks, including communicating conceptual models, visualizing simulation outcomes, generating educational materials, and interfacing heterogeneous models in multi-scale simulations. The tutorial combines conceptual guidance with practical workflows, explaining how modern image generators operate, how prompts and simulation outputs can be translated into visual scenes, and how practitioners can integrate these tools into reproducible local pipelines. By focusing on transferable principles rather than specific tools, the tutorial equips MS practitioners with the knowledge needed to evaluate, adopt, and adapt text-to-image generation in their simulation workflows.

[AI-237] Detecting Satellites in Radio-Frequency Data via Semi-Supervised Learning

链接: https://arxiv.org/abs/2606.20976
作者: Cade W. Trotter,Maksim E. Eren,Justin C. Holmes,J. Brent Parham,David Ewing,Boian S. Alexandrov,Gian Luca Delzanno
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radio-frequency (RF) monitoring is essential for space domain awareness, but it often generates large, variable, and sparsely populated datasets with few labels. These observations can capture satellites, space debris, and the ionospheric background, yet interpreting them typically requires specialized subject-matter expertise. Supervised deep learning methods can perform well on labeled RF data, but they require many annotated examples and may need careful retraining as RF conditions change. Semi-supervised approaches offer a practical alternative for limited-data settings by using unlabeled observations to reveal latent patterns that experts can interpret. In this paper, we present a semi-supervised RF detection and classification workflow for satellite monitoring that combines Non-negative Matrix Factorization with automatic model determination (NMFk), expert-guided cluster interpretation, and classifier-based prediction. We first represent RF observations as a non-negative feature matrix and apply NMFk to estimate the number of clusters that best captures patterns in the unlabeled data. Subject-matter experts then assign physical meaning to the resulting clusters, including satellite detections, ionospheric environmental conditions, and other RF event categories. Finally, we train a classifier on these interpreted clusters to evaluate performance on a test set and categorize future observations. This pipeline reduces reliance on large pre-labeled datasets by pairing unsupervised factorization with expert interpretation, enabling an interpretable and transferable methodology for detecting, observing, and classifying behavior in RF data.

[AI-238] AutoACSL: Synthesizing ACSL Specifications by Integrating LLM s with CPG-Based Static Analysis

链接: https://arxiv.org/abs/2606.20969
作者: Han Zhou,Yu Luo,Dianxiang Xu
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Generating formal specifications for C programs remains a challenge in formal verification due to the manual effort, expertise, and semantic precision required. While recent advancements in large language models (LLMs) offer promise in automating specification synthesis, current approaches often lack semantic depth and produce unverifiable or incomplete contracts. To address these limitations, we introduce AutoACSL, a novel framework that integrates LLM prompting with semantic features extracted from Code Property Graphs (CPGs). AutoACSL performs static analyses to extract key semantic elements, including arithmetic operations, loop and recursion structures, and return value propagation, which are encoded into structured prompts. These prompts enable the LLM not only to generate normal behavioral specifications but also to include constraints that prevent inputs leading to runtime errors. AutoACSL employs a feedback-driven synthesis loop, where candidate specifications are verified using Frama-C/WP and refined iteratively until verification succeeds or a termination condition is met. Evaluated on 604 programs drawn from diverse datasets, AutoACSL achieves a 98% specification generation success ratio and a 96% full proof ratio when paired with Gemini-3. Compared to a code-only baseline, AutoACSL improves the full proof ratio by 24.7% to 51.7% across four LLMs (GPT-o4 Mini, GPT-5.2, Grok-4.1, and Gemini-3), demonstrating that integrating large language models with CPG-based static analysis substantially enhances both generation robustness and verification effectiveness for automated ACSL specification synthesis.

[AI-239] Generative Responsible AI Data Evaluation Schema (GRAIDES) for AI Assurance in Local Government

链接: https://arxiv.org/abs/2606.20963
作者: Ethan Knights,Christopher Conlan,Temilorun Gbolahan,Stephen Waterman,Gurpreet Muctor
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Trust in the application of generative Artificial Intelligence (AI) relies on well-governed measurable evidence of performance and safety. In practice, however, evaluation data is often fragmented across systems, inconsistently structured and difficult to compare. We introduce the Generative Responsible AI Data Evaluation Schema (GRAIDES) as a lightweight open-source data model for centralising AI observability across popular vendors. Practical blueprints for code, architecture and statistical evaluation are shared as guidance about how to approach generative system assurance at the organisational level. Illustrative case study results are reported from Westminster City Council’s AI catalogue with a focus on measuring human-model alignment including detecting systematic disagreement between evaluators. By framing evaluations as a data modelling problem, GRAIDES provides a practical pathway toward more consistent and reproducible benchmarking, tuning and assurance activities for generative AI systems.

[AI-240] Is Our Benchmark Enough? An Analysis of Continual Learning for MLLM s ICML2026

链接: https://arxiv.org/abs/2606.20961
作者: Van-Tuan Tran,Shruthi Gowda,Merim Dzaferagic,Marco Ruffini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 Workshop “Continual Adaptation at Scale: Towards Sustainable AI”

点击查看摘要

Abstract:Continual adaptation is essential for multimodal large language models (MLLMs) deployed across evolving domains, but the state-of-the-art MR-LoRA method highly relies on the assumption that a MLLM-based router is necessary to process complex multimodal inputs. This paper revisits this claim on the MLLM-CL benchmark and argues for two claims. \textbfFirst, routing does not require an MLLM: a simple training-free, replay-free ptotypical routing method (\textscRePRo), uses frozen pretrained features and task prototypes to match the MLLM-based router of MR-LoRA at far lower computational cost. \textbfSecond, shared experts do not improve continual learning for MLLMs, despite their theoretical appeal. We show that these findings arise from two structural limitations of MLLM-CL: (1) its tasks are \textbfhighly separable in representation space, and (2) its fixed task order makes conclusions \textbfsensitive to a single curriculum rather than robust across diverse continual-learning trajectories. As a result, the benchmark primarily rewards learning in isolation rather than genuine continual transfer. This motivates a new design for future benchmarks of continual MLLM learning, with overlapping task manifolds, multiple task orders, fine-grained domain shifts, and evaluation protocols that reward forward transfer as well as retention.

[AI-241] Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

链接: https://arxiv.org/abs/2606.20950
作者: Sergei Trashchenkov
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 19 pages, 1 figure, 2 tables. Code and data: this https URL ; archived at this https URL

点击查看摘要

Abstract:Executable evaluation – checking the consequences of an agent’s actions with a program rather than grading its prose – has become a prominent way to assess tool-using AI agents in software settings. Electric power engineering has not yet had an analogous benchmark: language-model use is still dominated by retrieval and text question answering, while agents acting on power-system artifacts remain mostly academic prototypes. We introduce the Power Systems Agent Benchmark, an executable benchmark for power-engineering agents. An agent receives a structured task and returns a structured solution; a deterministic evaluator recomputes the engineering quantities, checks operational constraints, and returns a feasibility flag, a normalized score, and explicit violations. The benchmark contains 41 task families across eight areas of power engineering, from power flow and protection to stability, microgrids, reliability, power quality, and forecasting. Each task is grounded in a citable source, standard, or documented engineering formulation. To resist contamination, held-out cases are synthesized on demand by per-family generators from private seeds: the construction is inspectable, but the instances remain private. In a reference evaluation with three command-line agents, the strongest score near the compact tier’s ceiling, a smaller open model trails, and public and held-out performance are broadly consistent; a separate public-split grid with OpenCode and Aider probes harness effects. The reference evaluation doubles as quality control: unanimous failures flag candidate task or evaluator defects, and it exposed a latent evaluator bug missed by self-consistency checks. The evaluators are compact deterministic surrogates, but the task contract allows their internals to be upgraded to simulator-backed checks without changing how tasks are posed or solved. Comments: 19 pages, 1 figure, 2 tables. Code and data: this https URL ; archived at this https URL Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2606.20950 [cs.AI] (or arXiv:2606.20950v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.20950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-242] Short-Term Electricity Demand Forecasting for New England Using a Hybrid Transformer-XGBoost Framework with Weather Calendar and COVID-19 Indicators

链接: https://arxiv.org/abs/2606.20918
作者: Reza Ghanavati,Behrooz Mosallaei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Accurate short-term electricity demand forecasting is critical for reliable power system operation, energy market planning, and infrastructure optimization. This paper presents a hybrid framework combining a Transformer encoder for temporal feature extraction with gradient-boosted decision trees (XGBoost) for daily electricity demand forecasting across New England. The framework integrates meteorological observations from six cities spanning all six New England states, calendar and holiday effects, autoregressive demand lags, and COVID-19 epidemiological variables. Hyperparameter optimization uses Optuna with a multivariate Tree-structured Parzen Estimator over 500 trials, with a leakage-free 70/15/15 chronological train-validation-test split. The hybrid model achieves a test RMSE of 8,876 MWh, MAPE of 2.05%, and R-squared of 0.906. A tabular-only XGBoost baseline achieves RMSE of 9,304 MWh, MAPE of 2.21%, and R-squared of 0.896. A Diebold-Mariano test (Harvey-Leybourne-Newbold correction) confirms the 427.7 MWh difference is statistically indistinguishable from noise (DM = -1.126, p = 0.262). An ablation study reveals COVID-19 features improved training accuracy but had asymmetric test effects: removal degraded hybrid RMSE by 3.2% while marginally improving XGBoost-only by 1.2%. A SHAP temporal analysis shows 5 of 8 COVID features rank higher on the post-acute test set than during pandemic-active training, indicating the model over-applies learned pandemic patterns. These findings establish temporal validity decay as a central mechanism: behavioral disruptions drove a strong COVID-demand signal during 2020-2021, but adaptation was complete by mid-2022, leaving epidemiological features as noise amplifying overfitting to stale pandemic patterns.

[AI-243] Root Cause Analysis with Latent Confounders using Partial Ancestral Graphs

链接: https://arxiv.org/abs/2606.20912
作者: Henrique O. Caetano,Rafael Arone,Carlos Dias Maciel
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finding the source of failures, known as Root Cause Analysis (RCA), is essential for identifying the root causes of anomalies and maintaining the reliability of complex systems. While causal theory has advanced data-driven RCA, existing frameworks assume causal sufficiency, failing to account for the unobserved latent variables prevalent in real-world environments. To address this gap, we propose PAG-RCA. This framework models system failures as parametric interventions over Partial Ancestral Graphs (PAGs) to perform RCA in the presence of latent variables. We use standard causal identification algorithms to find the source of failures by quantifying causal effects over the PAG. When an effect is identifiable, candidate root causes are ranked based on their exact intervention effects. When effects are structurally unidentifiable, our framework (for the first time in the RCA literature) integrates partial identification to evaluate and score candidates using analytical causal bounds. By integrating latent variables and partial identification at once our framework ensures robust RCA even under data scarcity and latent-variable scenarios where traditional methods degrade. Evaluations on synthetic data, microservice anomaly benchmarks and power-grid cascading failures demonstrate that PAG-RCA consistently outperforms state-of-the-art data-driven baselines. By improving data-driven RCA performance under data scarcity, this methodology advances reliable automated diagnostics in partially observable complex networks.

[AI-244] Whose Agent Are You? Multi-Layer Fingerprinting and Attribution of Autonomous Web Agents

链接: https://arxiv.org/abs/2606.20910
作者: Dayeon Kang,Hyejun Jeong,Jade Sheffey,Pubali Datta,Amir Houmansadr
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI web agents proliferate, combining large language models with autonomous, browser-level control, indiscriminate content scraping by web agents has emerged as a privacy and security challenge. Existing defenses, such as this http URL and active bot-blocking, are insufficient, as they are widely violated and easily circumvented. In this work, we demonstrate that AI web agents can be effectively distinguished from humans and traditional crawlers using a multi-layer fingerprint based on both network layer characteristics (e.g., TLS, HTTP) and browser interaction behavior. We implement this mechanism as a programmatic logging framework that can be deployed on a live, instrumented domain. By analyzing six prominent agent frameworks (AutoGen, Browser Use, Claude, Gemini, Operator, and Skyvern), we uncover latent structural differences in how these systems assemble HTTP requests, establish TLS/HTTP connections, and execute autonomous browser actions. Feeding these multi-layer features into a decision tree classifier, our framework achieves high-fidelity identification (97% accuracy), successfully isolating distinct agent architectures and differentiating agent traffic from both human browsing baselines and legacy crawlers. Our findings demonstrate that cross-layer agent tracking provides a robust, evasion-resistant strategy for content protection and web security policy enforcement. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.20910 [cs.CR] (or arXiv:2606.20910v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.20910 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-245] MMGNN: Multi-level multi-color graph neural networks for molecular property prediction

链接: https://arxiv.org/abs/2606.20906
作者: Trung Nguyen,Duc Duy Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注:

点击查看摘要

Abstract:Molecular message-passing neural networks commonly propagate chemically diverse interactions through a single graph, which may mix interaction-specific signals and require deep propagation to capture long-range effects. We introduce the Multi-level, Multi-color Graph Neural Network (MMGNN), a hierarchical framework that decomposes a molecular graph into overlapping atom-type-pair-specific subgraphs while preserving atom-level resolution. MMGNN-2D constructs chemical-colored subgraphs from covalent connectivity, whereas MMGNN-3D constructs geometric-colored subgraphs from spatial proximity and augments their edges with distance, angular, and torsional descriptors. Both variants apply a shared communicative message-passing backbone to each subgraph and combine the resulting representations through atom-wise aggregation and molecular readout. We evaluated MMGNN on five classification and three regression benchmarks from MoleculeNet using common scaffold splits and five independent runs. MMGNN-2D achieved the highest macro-average AUC-ROC of 0.838 across the classification datasets and the lowest RMSE on ESOL (0.803). MMGNN-3D obtained the highest mean AUC-ROC on BBBP (0.956) and the lowest RMSE on FreeSolv (1.793), indicating complementary strengths of topological and geometric representations. Structural and leave-one-out analyses further illustrate how the subgraph decomposition affects learned representations and atom-type-pair sensitivities. These results support overlapping interaction-specific graph decomposition as a competitive strategy for molecular property prediction.

[AI-246] Vesta: A Generalist Embodied Reasoning Model

链接: https://arxiv.org/abs/2606.20905
作者: Johan Bjorck,Zhiqi Li,Yunze Man,Jing Wang,An-Chieh Cheng,Sifei Liu,Shihao Wang,Zhiding Yu,Abhishek Badki,Stan Birchfield,Valts Blukis,Yevgen Chebotar,Siyi Chen,Sicong Leng,Yu-Cheng Chou,Tianli Ding,Boyi Li,Zhengyi Luo,Hang Su,Jonathan Tremblay,Tingwu Wang,Bowen Wen,Jimmy Wu,Xianghui Xie,Hanrong Ye,Hongxu Yin,K.R. Zentner,Liangyan Gui,Yu-Xiong Wang,Yuke Zhu,Linxi “Jim” Fan,Jan Kautz
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots operating in open-world environments must seamlessly integrate localization, spatial reasoning, navigation, and long-horizon planning. While specialist models excel at individual tasks, deploying a multi-model stack is computationally expensive and prone to cascading errors. We present Vesta, a unified embodied generalist that consolidates these capabilities into a single foundation model. Our approach combines a diverse and massive curated corpus designed to induce spatial grounding and a simple multimodal memory harness that enables reasoning over extended time horizons. Across diverse benchmarks, Vesta on average beats individual SOTA baselines by 20% and beats an ensemble of per-category-best baselines by 10% – thus demonstrating that a generalist model can match or exceed specialists. On real-world robotic tasks requiring memory and reasoning, Vesta improves task success by 35%. Our work thus demonstrates that a single generalist is a feasible, scalable, and arguably preferable alternative to combining specialists.

[AI-247] Neurosymbolic Clinical Trial Matching via LLM -Driven Abduction and Logical Verification

链接: https://arxiv.org/abs/2606.20895
作者: Baiyang Qu,Leonardo Ranaldi,Xi Wang,Marco Valentino
类目: Artificial Intelligence (cs.AI)
备注: 21 pages (including appendix), 5 figures, 9 tables

点击查看摘要

Abstract:Large Language Models (LLMs) offer a promising path to automate Clinical Trial Matching (CTM), but still struggle with the deterministic verification required for complex eligibility criteria. Conversely, purely symbolic methods provide formal rigour but break down when faced with incomplete patient records and noisy clinical evidence. To bridge this gap, we investigate a hybrid framework for CTM combining LLMs with logical verification. In particular, we introduce an abductive neurosymbolic CTM framework (\alphaNeSy-CTM), which leverages the linguistic and world knowledge in LLMs to support reasoning over noisy and underspecified clinical text. Extensive evaluation demonstrates that \alphaNeSy-CTM substantially outperforms standalone LLM baselines, achieving up to 30% relative improvement over zero-shot baselines. In addition, our analyses confirm the impact of abductive reasoning on CTM, with \alphaNeSy-CTM exhibiting improved accuracy, specificity, and robustness over a non-abductive neurosymbolic setting. Furthermore, \alphaNeSy-CTM and Chain-of-Thought (CoT) reasoning prove highly complementary, highlighting the potential for a hybrid routing policy. Ultimately, this paper demonstrates the impact of neurosymbolic methods for automating CTM, providing a path toward the next generation of auditable, LLM-driven clinical applications.

[AI-248] Exploiting Neural Audio Codec Latents for Adversarial Audio Attacks INTERSPEECH2026

链接: https://arxiv.org/abs/2606.20893
作者: Sameek Bhattacharya,Bharath Krishnamurthy,Ajita Rattani
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to Interspeech 2026. 5 Pages with references containing 2 figures and 4 tables. Code is available at this https URL or this https URL

点击查看摘要

Abstract:Deep learning-based audio classification systems, including automatic speaker verification, are vulnerable to adversarial attacks. Realistic real-time threat assessment remains difficult because optimization-based methods, such as projected gradient descent (PGD) and Carlini-Wagner, require costly iterative updates in the high-dimensional waveform domain. Generative attacks allow single-shot synthesis but often introduce perceptible artifacts or depend on computationally intensive architectures, while diffusion and autoregressive approaches incur high inference latency. To address this gap, we propose a generative attack framework operating in the continuous latent space of a neural audio codec. A conditional generator synthesizes class-specific perturbations in a single forward pass and decodes them into adversarial waveforms. Our method achieves targeted attack success rates up to 99% with sub-7 ms inference, outperforming generative baselines while reducing latency by 24x.

[AI-249] he Substrate Collapse: AI Code Generation Invalidates Authorship-Based Knowledge Metrics

链接: https://arxiv.org/abs/2606.20882
作者: Brett Wheeler
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, no figures. Position paper; states a falsifiable prediction and deliberately leaves construction of the comprehension-grounded instrument open

点击查看摘要

Abstract:Software engineering has long inferred where a system’s knowledge resides from who authored its code. The truck factor, the Degree-of-Authorship metric, and the degree-of-knowledge model all rest on one inference – that authoring a region of code is evidence of understanding it – and for most of software’s history it was a workable proxy, because code entered a repository only when a human wrote it, which forced at least transient understanding. This paper argues that AI code generation severs that inference at its root, and that the consequence is not the degradation of the authorship-based metrics but their invalidation as a class. When an agent generates a module and a human merges it, the version-control record still attributes authorship, but the attribution no longer licenses any conclusion about comprehension: the same footprint is now compatible with full, partial, or no understanding. The metric still returns a number; the number measures a substrate that has come uncoupled from the quantity it was used to estimate. The collapse is corroborated by the field’s own measurement failures, and the methodological corollary is load-bearing: the instrument the comprehension-debt era needs cannot be built by refining the knowledge-concentration metrics, because no function of an authorship footprint recovers an inference the footprint no longer supports. The replacement must be grounded in evidence of comprehension rather than authorship. I state a falsifiable prediction that discriminates the two – that systems with a healthy authorship-derived truck factor but low comprehension-measured retention will suffer incident-resolution failures the authorship metric does not predict – and argue that building the comprehension-grounded instrument at the scale of a system and a team is the field’s open measurement problem, left open here.

[AI-250] When Do Intrinsic Rewards Work for Code Reasoning ? A Comprehensive Study

链接: https://arxiv.org/abs/2606.20881
作者: Xiaolong Jin,Xuandong Zhao,Wenbo Guo,Xiangyu Zhang,Dawn Song
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 45 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in large language model reasoning, but relies on ground-truth supervision that is costly or infeasible, especially in coding tasks. Recent work addresses this by deriving rewards from a model’s own signals, such as majority voting or confidence-based scores, achieving notable success on mathematical reasoning benchmarks. However, code generation poses distinct challenges: programs are structurally complex, semantically equivalent solutions may differ syntactically, and verification typically requires execution. Whether these intrinsic reward methods transfer effectively to code remains unexplored. In this work, we present a systematic empirical study of intrinsic reward methods for code generation. We conduct extensive experiments on LiveCodeBench, systematically evaluating representative certainty-based Reinforcement Learning from Internal Feedback (RLIF) approaches under different training scenarios and hyperparameter settings. Our experiments reveal that certainty-based methods yield early gains but inevitably collapse: models progressively shorten outputs and lose reasoning capability, with collapse speed sensitive to sample size and temperature. When used to initialize RLVR training, RLIF pre-training offers no significant improvement over training from scratch. We also provide actionable recommendations for using intrinsic rewards for training code reasoning models. Our study shows both the promise and limitations of intrinsic reward methods for code, informing future work on code models and agents.

[AI-251] Machine Learning Classification of Cryopathy Syndromes: A Comprehensive Comparative Study

链接: https://arxiv.org/abs/2606.20874
作者: Nataliya Shakhovska,Valentyna Chopyak,Ivan Izonin,Vira Haievska
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 37 pages, 12 figures

点击查看摘要

Abstract:Cryopathy syndromes are difficult to classify because laboratory patterns often overlap across diagnostic categories, while some diagnoses are rare. This makes routine interpretation of cryoglobulin-related tests challenging and increases dependence on expert judgment. The aim of this study was to develop and compare machine learning approaches for automated classification of cryopathy syndromes from laboratory data and to identify a practical strategy for clinical decision support. Methods: We analysed laboratory records from 2,686 patients assigned to 14 diagnostic categories. The dataset included demographic variables, cryoglobulin measurements, precipitation tests, and hemagglutinin and hemolysin titers. Data preprocessing included cleaning, encoding, imputation, normalization, and construction of clinically informed interaction features. We evaluated 12 modelling strategies, including Random Forest, Gradient Boosted Trees, Multi-Layer Perceptron, soft-voting ensembles, class balancing with Synthetic Minority Over-sampling Technique, hierarchical classification, period-aware models, targeted binary classifiers, and probability calibration. Performance was assessed using stratified train-test evaluation and stratified 5-fold cross-validation. The main metrics were macro-averaged F1 score, accuracy, Top-3 accuracy, and expected calibration error. The overall task proved difficult because of marked class imbalance and clinical overlap between diagnoses. The best multiclass performance was achieved by a soft-voting ensemble of Random Forest and Gradient Boosted Trees. Cross-validation confirmed stable performance for the balanced Random Forest model. Tree-based methods consistently outperformed the neural network model. Feature engineering improved discrimination, and the most informative predictors were derived cryoglobulin-based interaction features.

[AI-252] A3C3: AI Algorithm and Accelerator Co-design Co-search and Co-generation

链接: https://arxiv.org/abs/2606.20869
作者: Selin Yildirim,Yingbing Huang,Deming Chen
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a holistic methodology for artificial intelligence algorithm and accelerator co-design, co-search, and co-generation (A3C3), which jointly optimizes neural network architectures and their hardware implementations to address the inefficiencies of traditional top-down AI system design flows. Conventional AI deployment often treats model design and hardware mapping as separate stages: an algorithm is first developed for accuracy, and only afterward adapted to meet latency, throughput, energy, or resource constraints. This separation can lead to suboptimal systems, particularly as modern AI workloads become increasingly heterogeneous, memory-intensive, and platform-dependent. A3C3 instead parameterizes both algorithmic and accelerator design spaces and searches them jointly, enabling the automatic generation of model-accelerator pairs that better balance accuracy, latency, throughput, energy efficiency, and hardware utilization. This article is a book chapter of the Handbook of Embedded Machine Learning, edited by Sudeep Pasricha and Muhammad Shafique, Springer Nature.

[AI-253] Can LLM s Reason About Brand Ownership? An Empirical Study of Domain Attribution Intelligence

链接: https://arxiv.org/abs/2606.20868
作者: Fathima Mashood,Mohamed Nabeel
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: llm, brand intelligence, whois, search, squatting domains, brand domains

点击查看摘要

Abstract:When a new domain resembling a popular brand appears, defenders face a fundamental ambiguity: it may be an attacker-created squatting site for phishing, or it may be a domain the brand itself registered, either defensively, to block attackers, or legitimately, for a new product or service launch. Incorrectly flagging a brand-owned domain as malicious produces a false positive that harms end users and damages the brand’s reputation. Resolving this ambiguity requires brand intelligence: the ability to determine, at scale, whether a given domain belongs to a brand. Large language models (LLMs), with their broad knowledge of brand domain relationships, offer a promising zero configuration approach to this problem, but their reliability for brand intelligence tasks remains unknown. We present the first systematic empirical evaluation of LLM brand intelligence across three tasks: domain enumeration (Q1), open ended brand attribution (Q2), and binary ownership classification (Q3). We evaluate four models, Gemini 2.5 Flash, Gemini 3.5 Flash, Claude Sonnet 4.5, and Claude Sonnet 4.6, across four retrieval settings (in context, web search, WHOIS lookup, and combined) on 36 of the most phished brands. Our results reveal a stark dichotomy: models achieve up to 82% precision enumerating brand domains from memory alone, yet fail at ownership verification without external tools, with macro F1 at most 0.37 in ICL mode. WHOIS augmentation lifts Q3 macro F1 by up to 0.65 points, yielding near perfect precision (= 0.99), dramatically reducing the false positive risk for defenders. We provide concrete recommendations for deploying LLMs in brand protection pipelines.

[AI-254] SignVLA: Real-Time Sign Language-Guided Robotic Manipulation via Attention LSTM and Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.20857
作者: Ningwei Bai,Xinyu Tan,Harry Gardner,Zhengyang Zhong,Liuhaichen Yang,Luoyu Zhang,Zhekai Duan,Monkgogi Galeitsiwe,Zezhi Tang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable robots to execute manipulation tasks from natural-language instructions grounded in visual observations. However, existing VLA interfaces primarily rely on speech or text input, limiting accessibility for deaf, hard-of-hearing, and speech-impaired users. We present SignVLA, a real-time sign-language-guided VLA framework for accessible human-robot interaction. The system introduces a modular sign-to-text interface that converts visual sign gestures into semantic instructions compatible with downstream VLA policies. Given video streams, SignVLA extracts hand landmark features and employs an attention-enhanced Long Short-Term Memory (LSTM) network to capture temporal gesture dynamics for alphabet- and command-level sign recognition. A temporal stabilization module further improves prediction consistency in real-time interaction this http URL generated instruction sequence is then passed to a downstream VLA policy for sign-conditioned robotic manipulation. Experimental results demonstrate stable real-time sign recognition and successful execution of manipulation tasks driven by sign-language inputs. Our findings suggest that lightweight temporal sign recognition can serve as an effective and practical accessibility layer for multimodal embodied intelligence.

[AI-255] Study on Quantitative Dynamic Epistemic Logic for Belief Revision

链接: https://arxiv.org/abs/2606.20837
作者: Felipe Nunes de Souza Camargo
类目: Artificial Intelligence (cs.AI); Logic (math.LO)
备注:

点击查看摘要

Abstract:Belief revision is a process in which an agent begins to believe in something she previously did not. I begin the paper by presenting, based on (Gärdenfors, 1998; Hansson, 1999), postulates for belief revision that constitute the basis of the AGM theory. I will then briefly show the semantics of a modal logic introduced in (van Ditmarsch, 2005), which I call P '. This logic formalizes static epistemic states and has greater expressive power than AGM in doing so because it captures the quantitative notion of "degrees of conviction". The third step is to introduce revision operators on P and, mostly following (van Ditmarsch, 2005), obtain the Dynamic Epistemic Logic (DEL) I call P* '. It models processes of belief revision in several ways. Original results are presented in the following two sections. The first one of these sections revolves around a formalization of AGM postulates within P* by proving some theorems related to the satisfaction of those postulates by revisions defined in P* . The last section features an analysis of P* 's revisions that go beyond the mere satisfaction of postulates. I compare their formal behavior with respect to some philosophical criteria. At last, I conclude that the functions presented in (van Ditmarsch, 2005) are not good formalizations of the philosophical intuition behind AGM. Instead, it is captured by the function *^0 originally defined in this paper (but highly inspired by (van Benthem, 2007)). An implementation of this function is also provided.

[AI-256] What Shapes Emergent Misalignment? Insights from Training Dynamics Model Priors and Data

链接: https://arxiv.org/abs/2606.20814
作者: Yuchen Zhang,Anietta Weckauff,Diego Garcia-Olano,Maksym Andriushchenko
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Emergent misalignment (EM) is a phenomenon in which models generalize with narrow fine-tuning, leading to broad (yet uneven) misalignment across evaluation questions. We study EM and its variability directly through the components of fine-tuning: training dynamics, model priors, and data. (1) We first explored how in-domain training loss relates to out-of-domain alignment scores across datasets and model families. Then, we tried to induce potential alternative local minima through different learning schedules for one narrow fine-tuning, but did not find strong runs with better broad alignment scores conditioned on similar or lower training loss. (2) We found that although the mean and standard deviations of the misaligned model scores are usually statistically different from those of the pre-trained model, there are some potential signals on overall positive correlation. The evaluation prompt-only activations from both the pre-trained and the original instruct models (prior to narrow fine-tuning) could predict fine-grained alignment scores after narrow fine-tuning. (3) Finally, we compared activation deltas before and after narrow fine-tuning and found moderate-to-high subspace overlap and similarity between the resulting activation shifts for training and evaluation prompts. Subspace overlaps between training and evaluation prompt activations correlate with their shifts’ similarities when measuring with the last prompt-token activations. The train-evaluation data prompt overlap is controlled against overlap computed from random vectors and evaluation prompts activations.

[AI-257] B[FM]2: Brain Foundation Model via Flow Matching with SplitUNet

链接: https://arxiv.org/abs/2606.20812
作者: Jaedong Hwang,Kathleen Zhang,Wei Dai,Konstantinos Kontras,Maarten Vanmarcke,Maarten De Vos,Ila Fiete,Paul Pu Liang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:EEG foundation models can learn generalizable representations from large-scale EEG corpora to enable single-backbone transfer across diverse clinical and brain-computer interface tasks. Existing models typically discretize the continuous multi-channel EEG waveform into patches or codebook tokens and train a transformer with masked self-supervision. Recognizing that this discretization fragments continuous brain rhythms and obscures fine-grained temporal dynamics, we present B[FM] ^2 (Brain Foundation Model via Flow Matching), whose inductive bias aligns with the data by pretraining directly on the raw signal using continuous-time flow matching without patches, tokenization, or masking. However, multi-channel EEG signals pose an architectural challenge for flow matching: time is densely sampled and highly autocorrelated (thousands of timepoints), while the electrode axis is short (tens of channels) at distinct scalp positions. To address this time-electrode asymmetry, we introduce SplitUNet, a velocity network that factorizes each block into separate 1D temporal and 1D electrode convolutions and downsamples only along time, preserving electrode topology throughout the hierarchy. B[FM] ^2 sets a new state of the art on 7 of 9 standard downstream EEG classification tasks, using a pretraining budget of only 36,895 segments ( \approx 307h), 1-2 orders of magnitude ( \approx 30x) less than required by existing EEG foundation models. Further, it generates synthetic EEGs that two board-certified neurologists cannot distinguish from brain data (Cohen’s \kappa = -0.096). this https URL

[AI-258] Fara-1.5: Scalable Learning Environments for Computer Use Agents

链接: https://arxiv.org/abs/2606.20785
作者: Ahmed Awadallah,Sahil Gupta,Yash Lara,Yadong Lu,Hussein Mozannar,Akshay Nambi,Zach Nussbaum,Yash Pandya,Aravind Rajeswaran,Corby Rosset,Alexey Taymanov,Luiz do Valle,Vibhav Vineet,Spencer Whitehead,Andrew Zhao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Collecting computer use data from human demonstrations is expensive and slow, motivating the need for scalable generation strategies. This requires two key ingredients: environments in which agents can act and verifiers that can judge whether their demonstrations succeeded. We introduce FaraGen1.5, a scalable data pipeline for computer use agents composed of three modular components: environments, solvers, and verifiers. FaraGen1.5 uses both live websites and synthetic environments that faithfully simulate domains gated by authentication or that require irreversible actions. It employs a solver harness that can be powered by multiple models, including strong frontier models such as GPT-5.4, and also incorporates a user simulator to enable multi-turn rollouts. Finally, FaraGen1.5 scores the resulting trajectories with three complementary verifiers covering task correctness, efficiency, and critical-point adherence. Using data produced by this pipeline, we train Fara1.5, a family of native computer use agents (CUAs) at three scales built on Qwen3.5 (4B, 9B, and 27B). To train these models, we employ a supervised finetuning (SFT) recipe that carefully balances data from FaraGen1.5 for broad coverage, specific high-value tasks, and target model deficiencies in an iterative approach. Each model sets a new state of the art for its size class on browser-use benchmarks: Fara1.5-9B reaches 63.4% on Online-Mind2Web and 86.6% on WebVoyager, while Fara1.5-27B achieves 72.3% on Online-Mind2Web, which is competitive with much larger proprietary systems.

[AI-259] Formally Verified Code Synthesis for Structured Data Translation in a Medical Internet of Things ICML2026 ALT

链接: https://arxiv.org/abs/2606.20776
作者: Colin Samplawski,Adam D. Cobb
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Spotlight Paper at the Workshop on Structured Data for Health at ICML 2026

点击查看摘要

Abstract:In this work we present a LLM powered, evolutionary code synthesis system for structured data translation in a Medical Internet of Things settings. A key challenge in this domain is ensuring that the synthesized code is trustworthy and reliable. To this end, we integrate a formal verification step into our code synthesis pipeline to ensure that any generated code is guaranteed to satisfy predefined requirements. In particular, we present a case study of integrating a novel device (a pulse oximeter) into the existing network of devices. Our system generates a formally verified translation between the device’s JSON schema and the Fast Healthcare Interoperability Resources (FHIR) format used by the wider system. This formal verification stage ensures structured data translated by the generated code will always be in the target output schema. We provide a set of experimental results which demonstrate that our system is able to consistently generate correct translation at low cost.

[AI-260] A Topology-Aware Memory-Centric Architecture that Separates Root-Cause Derivation from Root-Cause Explanation

链接: https://arxiv.org/abs/2606.20758
作者: Momil Seedat
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern microservice deployments fail in ways that are easy to detect and hard to explain. When a fault propagates along service dependencies, alerts fire in floods, dashboards multiply, and the scarce resource, an engineer who understands how the services relate, is consumed reconstructing context that the monitoring stack discarded. We argue that the missing ingredient in autonomous operations is not a better anomaly detector or a larger language model, but operational memory: a persistent, structured representation of how a system normally behaves, how its parts depend on one another, and how it has failed before. We present O PS C ORTEX, a working multi-agent prototype that organizes this memory into four tiers and uses it to separate two tasks the field usually conflates: deriving a root cause and explaining it. Root cause is computed deterministically from a learned dependency graph and the temporal ordering of threshold crossings; a large language model (LLM) is then asked only to explain, confirm, and recommend, using evidence the system has already assembled. We motivate the design with two documented production cascading failures, review representative literature on observability, anomaly detection, graph-based localization, and LLM-assisted diagnosis, and show how each architectural choice maps directly to a failure mode those incidents exhibit. The prototype is validated on an instrumented e-commerce benchmark with eight injectable failure scenarios.

[AI-261] Massive Activations Are Architecturally Robust: A Controlled Scratch/Commitment Residual Stream Test

链接: https://arxiv.org/abs/2606.20743
作者: Maruthi Vemula(University of North Carolina at Chapel Hill)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 2 tables. Code at this https URL

点击查看摘要

Abstract:Trained transformers reliably develop massive activations, a small number of hidden dimensions whose magnitude is far above the median and which concentrate on the sequence-start token. Whether these outliers are a removable artifact of the residual stream’s overloaded read and write role, or instead a functional necessity, is actively debated. We test the artifact hypothesis directly, with an architectural intervention. Our architecture, Ledger Residuals, splits the residual stream into a mutable scratch stream (Deliberation) that intermediate computation may freely overwrite and a protected, decode-only accumulator (Commitment) that holds the representation the model reads out. If massive activations exist only because one stream is forced to be both scratchpad and answer, then a dedicated answer channel should remove the need for them. We find that it does not. In matched-loss language models at the 160M and 290M scales, the model rebuilds the canonical fixed-dimension, start-token outlier inside the protected channel. The rebuilt feature is smaller in magnitude than in a standard transformer but more sharply concentrated on the start token, and a stronger sparsity penalty makes it more persistent and more concentrated still, rather than removing it. Massive activations therefore look architecturally robust: they re-emerge in whichever representation the model decodes from, which is what we would expect if they are functional rather than incidental. We release our architecture and measurement code.

[AI-262] A Digital Twin Framework for Traffic-Aware UAV Pavement Monitoring without Lane Closure

链接: https://arxiv.org/abs/2606.20742
作者: Yamil Uchani,Grace Abigail Luna Verdueta,Mauricio Figueroa,Edwin Salcedo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the proceedings of the 6th Annual IEEE International Conference on Digital Twins and Parallel Intelligence (DTPI 2026)

点击查看摘要

Abstract:UAV-based pavement inspection can reduce the cost and risk of road-surface monitoring, but real-world deployment remains difficult when traffic, pedestrians, and temporary occlusions affect the visibility of defects. This paper presents a Unity-based digital twin framework for traffic-aware UAV pavement monitoring without lane closure. The proposed environment integrates procedurally generated road defects, dynamic vehicles and pedestrians, autonomous UAV navigation, and an embedded road-damage perception pipeline. The perception module uses a two-stage approach: a lightweight YOLOv8n detector first localises road defects, pedestrians, and vehicles, while a second classifier distinguishes among potholes, single cracks, and crocodile cracks. On the simulator test set, the full pipeline achieved 99.26% overall accuracy across five classes. The digital twin was then used to evaluate three recovery strategies for occluded road segments: hover-and-recheck, micro-repositioning, and skip-and-revisit. Experiments were conducted across different traffic densities and flight altitudes using coverage, mission time, energy consumption, and revisit ratio as operational metrics. Results show that flight altitude has a strong influence on inspection coverage and that adaptive recovery improves performance under occlusion. In particular, hover-and-recheck achieved the most consistent coverage under medium and high traffic conditions, reaching up to 97.03% coverage, while skip-and-revisit was most effective in low-traffic scenarios, reaching 97.95% coverage at medium altitude. These results demonstrate that digital twins can support the development and evaluation of traffic-aware UAV inspection strategies before real-world deployment.

[AI-263] Repeated Shared Access Enables Grokking but Edit Propagation Depends on a Fine-Grained Addressable Memory

链接: https://arxiv.org/abs/2606.20737
作者: Yanan Niu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 1 figure, 13 tables

点击查看摘要

Abstract:We study factual edit propagation in a controlled synthetic knowledge-graph QA setting, comparing four architectures that cross loop recurrence with shared memory access: dense (Dense), looped (Loop), dense with shared memory (Dense+Mem), and looped with shared memory (LMC). Dense fits in-distribution 2-hop compositions but fails OOD; both looped recomputation and memory rereading cross this OOD grokking barrier, indicating that repeated shared access – not a specific architecture – is the common ingredient for learning. Editing, however, separates the substrates along a different axis. On a shared pre-edit-correct ID set, a single-row factual edit propagates strongly in the two memory-bearing cells (LMC 0.78-0.92, Dense+Mem 0.71-0.96) and only weakly in the others (Loop 0.04-0.30, Dense 0.00-0.03); the separation is statistically clean (Mann-Whitney p=0.008 between memory and non-memory cells, p=0.55 between the two memory cells, though n=5 vs n=5 is underpowered to rule out a moderate gap). In LMC, atomic facts localize to dominant memory sites that composition rereads, and a one-row value edit on LMC’s own pre-edit-correct probes achieves 100% direct success with mean 0.989 intended propagation while moving unrelated facts ~0.1% of the time (matched specificity across substrates pending). A coarse hold-answer-subspace (HOLDANS) interchange diagnostic is consistent with this ordering, suggesting that what separates the substrates is when an edited fact is injected and how much computation remains afterwards to reuse it. These results dissociate learning competence from editing affordance: repeated shared access suffices to grok, but edit propagation depends on whether the substrate exposes a fine-grained addressable memory the forward computation can write to and later reread – an affordance that loop recurrence provides only partially.

[AI-264] When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration

链接: https://arxiv.org/abs/2606.20724
作者: Aagam Sogani,Botao Rui,Swetha Vaidyanathan,Rishi Agarwal,Minghao Yan,Shivaram Venkataraman
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-horizon web agents often fail in ways hidden by final-answer evaluation: they may visit useful pages, produce a well-formed answer, and terminate confidently while still missing fields, over-including unsupported items, or relying on stale evidence. We study these failures with Parallel WebBench, a parallel web-exploration benchmark containing 1,679 verified records: 350 manually curated parallel tasks and 1,329 reconstructed records with verified URL-based trajectories. We train WebExplorer-style agents with GRPO under human-only, balanced human-synthetic, and synthetic-heavy data mixtures. At 16k context and 16 interaction rounds, the best GRPO model improves completion over WebExplorer-8B from 50.7% to 96.0% and GPT-4.1-mini-judged element-wise F1 from 0.2489 to 0.4529, but binary accuracy remains far below completion. Trace-level analysis identifies three persistent failure modes: context-bound search loops, premature termination on partial answers, and synthesis collapse after relevant evidence has already been retrieved. These results show that synthetic-data GRPO reduces abstention and improves partial correctness, but leaves a completion-correctness gap that requires evidence-grounded coverage and synthesis diagnostics.

[AI-265] Escape from Delusional Echo Trap: Symmetry Breaking Stochastic Dynamics and Mathematical Mitigation Strategies for Algorithmic Sycophancy

链接: https://arxiv.org/abs/2606.20718
作者: Sayantari Ghosh,Saumik Bhattacharya,Partha Pratim Chakrabarti
类目: Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:We propose a rigorous and systematic mathematical framework for tracking the cognitive trajectories of a user, in the context of algorithmic sycophancy and AI-driven delusional spiraling. Using tools from dynamical systems theory and stochastic differential equations, we explore how individuals perceive, interpret, and update their beliefs as they interact with AI chatbots that possess hidden traits of sycophancy. We treat the evolving conviction as a continuous log-odds state variable, coupled into a stochastic differential equation, navigating a multi-valley potential energy landscape. Our analysis reveals several critical observations governing the stability and rigidity of belief dynamics. We demonstrate that the baseline prior perception of the individual is systematically enhanced by sycophantic feedback beyond a critical threshold. Here, the perceptual potential landscape undergoes a structural phase transition that severely deepens any incremental initial tilt present in the baseline state, transforming the landscape and giving rise to deep, highly resilient attractor basins that trap the individual in unshakeable, self-reinforcing, delusional convictions. Finally, we demonstrate that genuine external information can successfully challenge these rigid states. If this incoming evidence is strong and authentic enough to overcome the internal feedback barrier, it can correct the structural asymmetry caused by sycophancy, inducing a perception reversal that successfully restores the objective belief state.

[AI-266] FairTutor: Equity-Aware Pedagogical LLM Routing for Budget-Constrained AI Tutoring KDD2026

链接: https://arxiv.org/abs/2606.20713
作者: Qingyang Xu
类目: Artificial Intelligence (cs.AI)
备注: AI for Education Day at SIGKDD 2026, 14 pages, 2 figures

点击查看摘要

Abstract:Generative AI tutors provide real-time, personalized learning support, but also create a new education inequity: students with access to premium AI services may receive clearer explanations, more personalized guidance, and better scaffolding than students limited to free or low-cost services. To address this challenge, we propose FairTutor, an equity-aware model-routing framework that achieves cost-effective AI tutoring via pedagogically motivated multi-agent orchestration. FairTutor combines query analysis, pedagogical planning, low-cost model generation, evaluator-guided critique and revision, and selective escalation to premium AI models. We introduce access-tier AI Education (AIED) Advantage Gap to measure the quality difference between premium-access and budget-constrained tutoring, and TutorAccessEval, a benchmark spanning math, reading, writing, science, and language learning. Empirical evaluations show that FairTutor achieves 97.1% of premium pedagogical quality (in floor-adjusted Likert scale) while reducing serving cost by 71.6%. Sensitivity analysis reveals a tunable cost–quality Pareto frontier, enabling FairTutor to be tailored to the needs of diverse student populations.

[AI-267] Artificial Intelligence as Monism: Ontological Organisational and Methodological Implications

链接: https://arxiv.org/abs/2606.20704
作者: Bertrand K. Hassani
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This paper argues that Artificial Intelligence should be understood as a form of monism: a unified substance that cannot be decomposed into separate elements such as data, algorithms, or technical architectures. Drawing from philosophical traditions of monism, dualism, and holism, the paper contends that AI is not merely a collection of components but a single, indivisible essence reflecting the phenomena it replicates. Treating AI as monism has deep implications across multiple dimensions. Epistemologically, it positions AI as the central interpretive force across technological, organisational, and societal domains, while raising ethical and existential concerns regarding singularity, the homogenisation of innovation, and the concentration of decision-making power. At the organisational level, a monistic approach challenges traditional siloed structures, advocating instead for transversal, problem-centric teams whose mandate derives from the integrity of the problem rather than from departmental hierarchy. In project management, it implies a unified vision and an integrated evaluation of complexity in which no single stakeholder perspective dominates the assessment of outcomes. In data and information management, it calls for architectures that reflect the irreducible unity of the phenomena being modelled. Ultimately, this paper calls for a paradigm shift in how AI is conceptualised, governed, and integrated, suggesting that only by embracing AI as monism can organisations achieve genuine agility and avoid the structural inefficiencies inherent to reductionist approaches. Subjects: Artificial Intelligence (cs.AI); Applications (stat.AP) Cite as: arXiv:2606.20704 [cs.AI] (or arXiv:2606.20704v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.20704 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bertrand Hassani [view email] [v1] Mon, 15 Jun 2026 16:43:23 UTC (15 KB) Full-text links: Access Paper: View a PDF of the paper titled Artificial Intelligence as Monism: Ontological, Organisational, and Methodological Implications, by Bertrand K. HassaniView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs stat stat.AP References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-268] JPPD: Joint Prediction_Planning Diffusion with Differentiable Safety Guidance for Dynamic Obstacle Avoidance in Intelligent Transportation Systems

链接: https://arxiv.org/abs/2606.20686
作者: Jiahao Wu,Shengwen Yu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Shared-space transportation operation requires low-speed autonomous platforms to navigate safely and efficiently among pedestrians, service robots, micromobility users, carts, and other road users. Most existing systems decompose this problem into trajectory prediction followed by motion planning, which creates one-way information flow: predicted participant futures influence the robot plan, but the selected robot plan cannot influence the predicted multi-agent evolution. This paper presents a joint prediction-planning diffusion framework that treats participant prediction and robot planning as a single conditional trajectory generation problem, where the model samples the future robot trajectory and all participant trajectories from one coupled distribution using a causal Transformer with cross-trajectory attention. To replace heuristic repulsive post-processing, the framework introduces differentiable safety potential guidance, a time-varying occupancy-probability potential whose gradient directly guides the joint sampler, and conditional flow matching is used to reduce inference steps while preserving multimodal trajectory diversity. The evaluation emphasizes shared-space operational effects, including near misses, blockage time, induced participant deviation, hard-braking events, and embedded latency, rather than treating average displacement error and final displacement error as the main result. Experiments in scenario-grounded simulation, naturalistic pedestrian replay, Isaac Sim validation, and ROS/Orin deployment show that joint sampling improves tail safety and runtime efficiency over a separated prediction-then-planning baseline.

[AI-269] VQ4SNN: Vector Quantization for Memory-Efficient FPGA Spiking Neural Networks

链接: https://arxiv.org/abs/2606.20675
作者: Dimitrios Sekertzis,Giorgos Dimitrakopoulos
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer an energy-efficient paradigm for edge AI, making them attractive for hardware acceleration. However, deploying dense SNNs on FPGAs is constrained by limited on-chip memory for synaptic weight storage. To address this bottleneck, we propose VQ4SNN, a hardware-aware architecture that reduces memory requirements through Vector Quantization (VQ)-based weight sharing. To the best of our knowledge, this is the first application of VQ to pipelined spatial-dataflow SNN accelerators on FPGAs. VQ4SNN replaces conventional weight storage with a two-level memory organization consisting of compact pointers and a shared codebook of quantized weight vectors. The proposed design integrates FPGA-aware memory mapping with analytical VQ parameter selection, enabling efficient deployment on such accelerators while preserving inference accuracy. The experimental results show a reduction of 52-61% in the total number of BRAMs compared to the state-of-the-art uncompressed FPGA SNNs without increasing overall logic utilization.

[AI-270] A Formal Tool for Verification of Probabilistic Spiking Neural Networks Based on Quotient Abstractions ICANN26

链接: https://arxiv.org/abs/2606.20674
作者: Nikan Zandian Jazi,Elisabetta De Maria,Christopher Leturc
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 15 pages. A shortened version of paper was submitted to and accepted at ICANN 26

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) model biological neural dynamics more faithfully than classical artificial networks, but their stochastic, event-driven computation – rooted in ion-channel noise and unreliable synaptic vesicle release – demands probabilistic models for which deterministic abstractions are mathematically inadequate. Formal verification of such models via probabilistic model checking faces a fundamental barrier: the state space explosion problem, where the Discrete-Time Markov Chain (DTMC) encoding grows exponentially with the number of neurons. General-purpose quotient model abstractions [1] can in principle mitigate this growth by partitioning membrane potentials into equivalence classes, but a naïve application to SNNs discards synaptic weight information, limiting the properties that can be verified. This paper introduces a weight-discretized quotient model abstraction that maps continuous synaptic weights to a compact integer range while preserving the relative contribution of each synapse, and presents CogSpike, a unified workbench that integrates SNN design, simulation, and PRISM-based formal verification within a single isomorphic tool chain. The discretization is accompanied by formal correctness guarantees: a two-sided fidelity theorem confines any firing disagreement to a bounded gray zone around threshold, and an Asymptotic Silence theorem gives the exact limit guarantee that unforced neurons fall permanently silent. A topology-dependent scaling analysis shows that the state space reduction compounds exponentially – approximately 17\times per neuron for discretization parameter W = 3 – enabling verification of networks that are otherwise intractable, as confirmed empirically across seven canonical topologies.

[AI-271] Genetic Algorithm Based Coordination and Optimization Model for Generation Grid Load Storag e in Active Distribution Networks

链接: https://arxiv.org/abs/2606.20672
作者: Jinlu Zhang,Fujian Chi,Tianhan Ling,Yulong He,Kejia Zhang,Hongxing Lv,Sheng Wang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Create an optimization framework that combines fuzzy logic and genetic algorithms for risk assessment and coordination of generation, grid connection, load, and energy storage facilities in active distribution networks. In order to capture the system uncertainties caused by weather factors and dynamic user demands, this project uses fuzzy set representation to model renewable energy generation, load curves, and market prices. In the evolutionary process, genetic algorithms introduce fuzzy elements into system management, adjusting thru penalty factors to improve stability and seeking better scheduling or power dispatch solutions. At the same time, by penalizing uncertainty and constraint violations, the fitness function of the hybrid model aims to optimize the expected operational costs; it can produce feasible and economical results under unknown parameter conditions. Under conditions with a large supply of renewable energy and installed energy storage systems, simulation results for the IEEE-69 power system indicate that the fuzzy genetic algorithm strategy effectively reduces technical constraints compared to deterministic optimization schemes. The total investment remains at a similar level. Other data indicate that thru fuzzy reasoning optimization, unreasonable choices or impossible situations are avoided during the network adaptation process. This framework-based technical approach helps to construct an effective computational scheme that is aware of uncertainty in its outputs. This will provide a scientific basis for the uncertainty assessment in distribution network planning.

[AI-272] owards CSI-Native Foundation Models: A Channel-Adaptive Roadmap for 6G

链接: https://arxiv.org/abs/2606.20670
作者: Chenyu Zhang,Xinchen Lyu,Chenshan Ren,Shuhan Liu,Qimei Cui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 7 pages, 5 figures, submmited to IEEE WCM

点击查看摘要

Abstract:Wireless foundation models offer a path toward reusable channel state information (CSI) intelligence for sixth-generation (6G) systems. However, existing generic-backbone adaptation and CSI pretraining methods often treat CSI as task tensors rather than propagation-conditioned channel responses, thereby failing to capture the intrinsic time-frequency-spatial geometry of wireless environments. This paper presents a channel-adaptive roadmap toward CSI-native foundation models, proposing a unified framework that aligns pretraining, positional modeling, and attention control with three channel requirements: scale-aware heterogeneous exposure, physical time-frequency-antenna coordinates, and correlation-bounded token interaction. Extensive experiments demonstrate the superiority of the proposed framework across three dimensions: zero-shot generalization, reducing NMSE by more than 4 dB across spatial-temporal-frequency tasks; scale extrapolation, yielding up to a 5.4 dB gain under 8 times unseen antenna scaling; and inference efficiency, accelerating mobility-aware processing by up to 18.8%. A system-level evaluation with Sionna SYS further shows that the proposed framework uses only 7.01% of dense-pilot overhead, reaches -18.64 dB average NMSE, and improves average net spectral efficiency by 36.6% over dense LMMSE and 15.5% over WiFo, indicating that CSI-native representation learning can support pilot-efficient radio access.

[AI-273] Agent Behavior Mining: Generative AI Agent Governance in Business Processes

链接: https://arxiv.org/abs/2606.20669
作者: Hoang Vu,Maximilian Körner,Adrian Rebmann,Gabriel Kevorkian,Michael Perscheid,Gregor Berg,Timotheus Kampik
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at BPM conference 2026 management main track

点击查看摘要

Abstract:As organizations increasingly deploy generative AI agents to automate business processes, they face a governance dilemma: although these agents can increase operational flexibility, their non-deterministic nature challenges the control and standardization that Business Process Management seeks to enforce. This paper addresses this \emphinvisible autonomy risk by introducing \emphAgent Behavior Mining, a governance capability that enables the application of process mining techniques to render generative AI agent decision-making observable and traceable. We (1) improve the understanding of generative AI agent behavior through an event data model that translates granular agent activities – including reasoning traces, tool usage, and token costs – into standardized process logs; (2) instantiate the data model in a multi-agent order-to-cash implementation, demonstrating how process managers can leverage agent logs to detect policy deviations and quantify operational variability; and (3) evaluate the perceived practical utility of the approach in an exploratory study with 18 industry practitioners. The results indicate that practitioners view behavioral transparency as a prerequisite for trust and consider the ability to examine agent reasoning as an important governance requirement for the next generation of AI-driven business processes.

[AI-274] BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems ICML2026

链接: https://arxiv.org/abs/2606.20668
作者: Leonhard Waibl,Felix Michalak,Hadrien Mariaccia
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the ICML 2026 Workshop on Trustworthy AI for Good (AI4GOOD). 2 figures; main text plus appendices

点击查看摘要

Abstract:LLM supervision systems, namely input/output moderation filters and jailbreak detectors, are the primary safeguard against misuse in deployed AI applications, yet existing benchmarks are often vendor-biased, omit cost and latency, and rarely compare specialized guardrails against repurposed generalist LLMs. We present BELLS-O (Benchmark for the Evaluation of LLM Supervision Systems, Operational), the first independent operational benchmark of LLM supervision systems. BELLS-O evaluates 28 systems from 17 providers: every major specialized guardrail (e.g., LlamaGuard-4, ShieldGemma-2, Lakera Guard) and frontier generalists repurposed as supervisors (e.g., GPT-5.4, Claude Sonnet 4.6, Grok-4.1), jointly on detection rate, false-positive rate, latency, and monetary cost. We cover input/output moderation across 11 harm categories and jailbreak detection across 13 attack techniques, using in-house datasets built from handcrafted prompts, expert-curated samples, and quality-controlled synthetic generation. To suppress latent generator fingerprints in synthetic data, every generated sample is paraphrased. Mapping the Pareto frontier reveals use-case-dependent tradeoffs. On content moderation, specialized supervisors are operationally dominant: top systems match frontier LLMs on detection (~95% vs. 94%) at comparably low false-positive rates (=2%), while running 5-10x faster and ~10x cheaper. On jailbreak detection, the tradeoff shifts: frontier LLMs achieve higher detection and lower false-positive rates but at 10-50x higher cost and 5-10x higher latency. We release the benchmark, framework, leaderboard, and datasets as the first vendor-neutral basis for selecting safeguards under real deployment constraints.

[AI-275] A Quantum-Assisted Agent ic Distributed Artificial Intelligence Framework for Deadline-Bounded Orchestration of Hybrid Renewable Microgrids

链接: https://arxiv.org/abs/2606.20667
作者: Iacovos I. Ioannou,Saher Javaid,Minella Bezha,Yasuo Tan,Naoto Nagaoka,Vasos Vassiliou
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The real-time orchestration of microgrids that combine fluctuating renewable sources, dispatchable units, storage and curtailable consumers requires the repeated solution of combinatorial dispatch and coalition formation problems under hard control deadlines. In this paper, a quantum-assisted agentic distributed artificial intelligence (DAI) framework is proposed in which the dispatch problem of each control slot is formulated as a quadratic unconstrained binary optimization (QUBO) problem by Belief-Desire-Intention extended (BDIx) agents and is solved by a portfolio of quantum, quantum-inspired and classical solvers. Solver selection is elevated to a first-class agentic deliberation action of the coordinator agent. Learned beliefs about solver latencies are maintained and the solver intention that is expected to satisfy the prevailing deliberation deadline is committed in each slot. In addition, a belief-shaped storage valuation mechanism is introduced through which the storage agent prices its energy at a discounted future-peak value, injecting intertemporal information into the otherwise myopic per-slot optimization. The framework is evaluated on a 24-hour simulation of a grid-connected microgrid with photovoltaic, wind, battery, genset and demand-response assets, with the Quantum Approximate Optimization Algorithm (QAOA) executed by statevector simulation and benchmarked per slot against tabu search, simulated annealing, binary particle swarm optimization, greedy descent and exhaustive enumeration. Zero deliberation deadlines are missed, the committed dispatch attains the exact optimum on every slot and the realized daily cost of 146.24 EUR equals the exact lower bound, with 97.83 percent renewable utilization and zero unserved energy. When the storage valuation mechanism is deactivated, the daily cost is increased to 152.75 EUR, a 4.5 percent increase.

[AI-276] Robust Auto-associative Memory via Convolutional Restricted Hopfield Networks

链接: https://arxiv.org/abs/2606.20666
作者: Ci Lin,Tet Yeap,Iluju Kiringa
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Associative memory models play a fundamental role in pattern retrieval, but their performance often degrades under adversarial perturbations and severe input corruptions. Existing approaches, including Modern Hopfield Networks (MHNs), and Predictive Coding Networks (PCNs), exhibit limitations in balancing storage capacity, computational efficiency, and robustness. In this paper, we propose a Convolutional Restricted Hopfield Networks (CRHNs), which integrates convolutional feature extraction with attractor-based memory retrieval in a structured latent space. The proposed model leverages subspace representations and fixed-point dynamics, trained via a gradient-free Subspace Rotation Algorithm (SRA), to enhance both robustness and memory capacity. Extensive experiments on Self-Taught Learning (STL) dataset demonstrate that CRHNs consistently achieve significantly lower reconstruction error compared to MHNs and PCNs across a wide range of adversarial attacks and input degradations. In many cases, CRHNs reduce reconstruction error by an order of magnitude and maintains stable retrieval performance under increasing perturbation strength. Statistical analysis further confirms that these improvements are significant ( p 0.01 ). These results highlight the effectiveness of attractor-based memory mechanisms and suggest that CRHNs provide a promising framework for building robust and scalable associative memory systems. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.20666 [cs.NE] (or arXiv:2606.20666v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2606.20666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-277] Skill Coverag e: A Test Adequacy Metric for Agent Skills

链接: https://arxiv.org/abs/2606.20659
作者: Boyin Tan,Xiaowei Huang,Youcheng Sun
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Agent skills encode reusable procedural knowledge that guides large language model agents across tasks and execution contexts. Existing evaluations primarily assess skills through task level outcomes, yet task success alone does not reveal which parts of a skill have been exercised or which remain untested. We introduce skill coverage, a test adequacy metric that treats the skill artifact as the object under test. Our approach extracts observable skill behavior constraints from skill documents and measures whether an agent trajectory provides sufficient evidence to exercise and evaluate each constraint. Skill coverage uses a binary cover or not cover judgment, which reports whether a documented behavior has been exercised with sufficient observable evidence, without assigning an additional outcome label to the behavior. Applying skill coverage to SkillsBench reveals that existing benchmark executions cover only 39.90 to 43.98% of skill behavior constraints, suggesting that current benchmark tasks leave large portions of documented skill guidance untested. These findings show that successful task completion does not imply adequate testing of the skill artifact, highlighting skill coverage as a measure of how thoroughly agent skills are tested.

[AI-278] Expected Free Energy-based Planning as Variational Inference

链接: https://arxiv.org/abs/2606.20658
作者: Wouter W. L. Nuijten,Thijs van de Laar,Bert de Vries
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Planning under uncertainty requires agents to balance goal achievement with information gathering. Active inference addresses this through the Expected Free Energy (EFE), a cost function that unifies instrumental and epistemic objectives. However, existing EFE-based methods typically employ specialized optimization procedures that are difficult to extend or analyze. In this paper, we show that EFE-based planning can be formulated as Variational Free Energy minimization on a generative model augmented with epistemic priors. Our main result demonstrates that minimizing a Variational Free Energy functional with appropriately chosen priors yields a decomposition into expected plan costs (the EFE) plus a complexity term. This formulation reinforces theoretical consistency with the Free Energy Principle by casting planning as the same inferential process that governs perception and learning. We validate our approach on three environments of increasing complexity: a deterministic T-maze, a stochastic Reactivity Maze, and a partially observable MiniGrid DoorKey-8x8 environment. The experiments demonstrate that the epistemic priors induce information-seeking behavior, that the variational formulation yields policy-based inference outperforming plan-based methods under stochastic transitions, and that temporal factorization enables scalability to environments where existing tabular active inference methods cannot operate.

[AI-279] A-Evolve-Training: Autonomous Post-Training of a 30B Model

链接: https://arxiv.org/abs/2606.20657
作者: Zhan Shi,Bing He,Yisi Sang,Hanqing Lu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages,

点击查看摘要

Abstract:Post-training a frontier model is normally weeks of human work: proposing data and recipe changes, launching runs, reading evals, deciding what to keep. We report an autonomous system that runs this loop with no human in the loop, post-training a 30B Nemotron across four rounds over multiple weeks. The autonomously produced model reaches a held-out score of 0.86 against the top human submission’s 0.87 on the public NVIDIA Nemotron-Reasoning Challenge leaderboard, placing 8th of ~4000 at the time of writing. More striking than the number: the loop detected that its own dev metric had stopped tracking external performance on the weakest domain – candidates drove dev to record highs without moving the external target – and revised its own search policy, no longer maximizing dev but seeking interventions that lowered the now-misleading proxy while improving the external target. We treat this as direct, auditable evidence that a scaled autonomous loop can produce discovery, not only optimization: it detected that its measurement frame had become misleading and changed what counted as evidence. We take the operational view that any system worth the “recursive self-improvement” label must eventually perform end-to-end post-training of a frontier-class model; this is one datapoint of that bar being cleared. We do not claim a “first autonomous match” of human researchers. The claim we make is narrower and auditable: to our knowledge, this is the first publicly reported autonomous post-training run at this scale, where prior public autonomous-ML-research demonstrations sit at GPT-2-class (~124M) budgets. The same system also post-trains the 120B and 550B Nemotron; with no public human baseline there, this shows only that the loop closes at that scale, not that its output is competitive – infrastructure evidence, with the effectiveness claim deferred until a comparable human anchor exists.

[AI-280] Learning Splitting Heuristics for Parallel String Solvers

链接: https://arxiv.org/abs/2606.20656
作者: Chenhao Gao,Peisen Yao
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:String constraint solvers are crucial for reasoning about string-manipulating programs. However, many practical string constraints are undecidable, and real-world applications often present complex constraints that challenge current solvers. The rise of multi-core architectures offers an opportunity for parallel solving. A key parallel solving method is \emphcube-and-conquer, in which the quality of splitting heuristics is critical to effectively dividing the search space. Unfortunately, manually designing the heuristics is labor-intensive, and handcrafted heuristics are often sub-optimal. This paper introduces a data-driven approach to automatically generating splitting heuristics. We frame the problem of selecting a splitting atom as a learning task, using features from input formulas and dynamic data from solver execution. We implement this approach in two popular string solvers, Z3seq and Z3str4, demonstrating that the learned heuristics outperform manually designed ones in the number of solved formulas and the average solving time.

[AI-281] Distributed Model Predictive Control with Adaptive Safety Zones for Multi-Fleet Drone Operations

链接: https://arxiv.org/abs/2606.20651
作者: Linda Mümken,Diyar Altinses,Michael Schwung,Stefan Lier,Andreas Schwung
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 12 pages, 8 figures, 2 tables. Submitted to IEEE CONES on April 15, 2025. Code: this https URL

点击查看摘要

Abstract:Autonomous drone swarms in space-constrained environments such as warehouses, inspection corridors, and urban delivery routes must share limited airspace safely at high vehicle density. Existing approaches rely on fixed safety zones sized for worst-case velocity, which wastes airspace in congested scenarios. We replace the fixed radius with an adaptive, speed-dependent safety sphere whose size scales with braking distance: tight at low speeds, expanded at high speeds. We develop both a centralized model predictive control (MPC) formulation and a distributed MPC (DMPC) in which each drone optimizes locally from detected neighbors, accommodating mixed fleets with non-cooperative agents. We prove feasibility up to the geometric packing limit evaluated at the minimum radius, establish Lyapunov stability under sufficient conditions on the adaptation parameter, drone density, and prediction horizon, and extend these guarantees to the distributed setting via a contraction condition that preserves the centralized stability margins. We further derive modified sphere-packing capacity bounds and a throughput-optimal crossing speed for narrow passages. Simulations confirm that the adaptive framework remains feasible where fixed-radius methods fail: it roughly doubles the admissible drone count, reduces traversal time through constrained passages by about 25 percent, and enables passage through openings impassable to static safety zones. The centralized variant realizes a larger fraction of the theoretical capacity, while the distributed variant offers a more realistic deployment model for mixed-fleet operations under the same safety guarantees.

[AI-282] Bridging Multi-Valued Heuristics and Dimensionality Reduction in Multi-Objective Search

链接: https://arxiv.org/abs/2606.20644
作者: Maya Wolff,Ariel Felner,Oren Salzman
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 6 figures. To appear in Proceedings of SoCS 2026

点击查看摘要

Abstract:Multi-objective shortest-path (MOSP) algorithms traditionally rely on single-valued heuristics (SVHs), which associate each state with a single admissible cost vector. While SVHs provide safe lower bounds, they fail to capture the trade-off structure of the Pareto frontier and often yield weak search guidance. Multi-valued heuristics (MVHs) address this limitation by mapping states to sets of cost estimates, enabling a richer approximation of possible trade-offs. Modern MOSP algorithms are highly dependent on dimensionality reduction (DR) techniques to efficiently perform dominance checks. However, integrating MVHs with DR introduces subtle correctness challenges. We show that naively combining DR with MVHs destroys the ordering invariants required for DR, leading to unsound and incomplete search. To address this issue, we develop the first theoretical frameworks for safely integrating MVHs with DR. First, we introduce \textNAMOA^\textdr\text-\textmvh , a theoretical baseline that restores search correctness by enforcing heuristic consistency. Recognizing the practical limitations of this approach, we then introduce our primary contribution, \textL\text-\textNAMOA^\textdr\text-\textmvh . This algorithm employs a “lazy,” optimistic approach to DR, preserving exact correctness with only an admissible MVH by dynamically detecting and repairing local ordering violations. Across a range of benchmarks, \textL\text-\textNAMOA^*\textdr\text-\textmvh matches or improves over state-of-the-art MOSP algorithms, and achieves speedups of over 10x in instances where the additional guidance provided by the MVH translates into stronger pruning.

[AI-283] Hypothesis-Disciplined Multi-Agent Automated Formalization of Asymptotic Statistical Theory

链接: https://arxiv.org/abs/2606.20642
作者: Tingzhou Wei,Zeyu Zheng,Ethan X. Fang,Junwei Lu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Asymptotic statistical theory is a challenging domain for AI-assisted formalization: its central results mix convergence statements, asymptotic expansions, functional analysis, and regularity conditions that have a large gap from existing infrastructure in Lean 4 formalization. To address these challenges, we propose a hypothesis-disciplined Lean 4 formalization pipeline built from multiple agents: a manager that coordinates seven specialist roles for proof planning, skeleton scaffolding, Mathlib reconnaissance, proof construction, integration, independent review, and audit. The main methodological discipline is the hypothesis-disciplined audit, implemented by the Auditor agent: every main-theorem hypothesis and concept-layer field must be anchored in the source mathematical prose, justified as a Lean encoding adapter, marked as source-implied, or rejected as an unsupported strengthening. Using this workflow, we build a systematic formalization of asymptotic statistical theory, especially the parametric and semi-parametric models’ asymptotic distribution and efficiency results. The resulting Lean development is axiom-clean and source-faithful, with Lean-checked and human-audited proofs of core parametric and semi-parametric theorems organized so that theorem-agnostic infrastructure and statistical concept definitions are separated from theorem-specific assembly. The formalization results are available at this https URL.

[AI-284] MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

链接: https://arxiv.org/abs/2606.20641
作者: Letian Chen,Yiren Lu,Justin Fu,Yichen Xie,Runsheng Xu,Jyh-Jing Hwang,Ben Sapp,Drago Anguelov
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token prediction objective merely encourages per-token imitation in text, often irrespective of multi-step consequences and the alignment with crucial planning considerations such as giving space to other road actors. To overcome these limitations, we propose a reinforcement learning fine-tuning (RLFT) approach, MAGNIFIED, that aligns the MLLM-based driving agent with planning objectives by learning from token-level rewards. By mapping a sequence of predicted tokens to corresponding vehicle trajectories and learning from planning rewards, MAGNIFIED optimizes for the true planning objectives rather than focusing solely on token prediction accuracy, enabling the model to refine its understanding of the planning task beyond simple imitation. We validate our approach on the Waymo Open Motion Dataset with a novel setup incorporating rasterized birds-eye views and tokenized trajectories as inputs and planning-oriented outputs. An initial SFT phase establishes a strong baseline in outputting plan trajectories as sequences of X-Y coordinates in text, while subsequent RL fine-tuning substantially enhances planning performance relative to the SFT baseline (demonstrating over a 10.5% reduction in overlap rate and a 38.9% reduction in off-road rate), underscoring the potential of RLFT on MLLMs to achieve vehicle planning that is better aligned with compliant, comfortable, and efficient driving.

[AI-285] An LLM -Explainable DRL Framework for Passenger-Directed Autonomous Driving

链接: https://arxiv.org/abs/2606.20640
作者: Ouided Braoui,Meriem Bouali,Nadir Farhi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Autonomous vehicles offer the potential for safer and more efficient mobility, yet public trust remains limited due to the lack of transparency in their decision-making. This work addresses this issue by combining deep reinforcement learning (DRL) for adaptive driving control with large language model (LLM)-based explainability modules designed to communicate agent behavior to passengers. DRL agents were trained in simulation using a Dueling Double Deep Q-Network to follow distinct driving requests: \textitfast, \textitcomfort, and \textitstop. They demonstrated stable learning, safe compliance with traffic rules, and reliable switching between modes within a single trip. In parallel, LLM modules were introduced to interpret passenger requests, determine when explanations were needed, and generate concise, safety-oriented justifications. Results show that this framework, serving as a proof of concept for integrating RL decision-making and LLMs, balances safety, adaptability, and explainability, and is most effective when requests are delayed or overridden due to safety constraints.

[AI-286] RIZZ: Routing Interactions to Near Zero-Interference Zones for Continual Adaptation of Black-Box Agents

链接: https://arxiv.org/abs/2606.20638
作者: Sonali Goel,Pranav Vaidhyanathan,Lucas Schorling,Natalia Ares,Maike Osborne
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Large language models are increasingly deployed as long-lived agents that must adapt across users, tasks, domains, modalities, and feedback regimes without access to model weights. Existing black-box adaptation methods typically optimize a single prompt, maintain an undifferentiated memory, or rely on repeated rollout-heavy search. However, these designs struggle when streams of input are nonstationary, feedback is sparse, and failures from one task family can contaminate behavior on another. We introduce RIZZ (Routing Interactions to Near Zero-interference Zones), a continual adaptation framework for compound language-model systems that learns entirely through verifier-gated memory, routing, and prompt compilation. RIZZ organizes input streams into dynamically spawned memory branches. At inference time, either while online or offline, a context-aware router selects or creates a branch that retrieves branch-local, global, graph-structured, and working-memory context, which is compiled into a bounded prompt together with retrieved task evidence. After the model acts, task verifiers score the output, and only verified interactions can update memory, promote reusable rules, demote harmful rules, or create anti-patterns. This yields a black-box agent that improves through persistent natural-language feedback while explicitly controlling interference. RIZZ targets the regime where adaptation must occur online under context budgets. Finally, we demonstrate the effectiveness of our framework against state-of-the-art baselines on competitive benchmarks.

[AI-287] Constituency Optimisation Through Hamiltonian Representation Of Mandates (COTHROM): Algorithmic Redistricting of Irish Election Boundaries

链接: https://arxiv.org/abs/2606.20637
作者: Ruaidhrí Campion,Matthew Fenlon,Joshua Cooney Mercedal,Casey Farren-Colloty,Eliza Somerville,Michael A.J. Mitchell
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注: 35 pages, 16 figures

点击查看摘要

Abstract:Electoral redistricting in Ireland’s Proportional Representation Single Transferable Vote (PR-STV) system faces the challenge of selecting an optimally representative set of electoral boundaries from an enormous set of possible configurations, and where ``representative’’ is a delicate balance of constitutional objectives that are often in tension with one another. We present the first computational framework for Irish electoral redistricting that systematically optimises across multiple constitutional requirements while making trade-offs explicit and quantifiable. The electoral redistricting problem is parsed using statistical physics, where constitutional objectives are considered as terms in a Potts Hamiltonian. Markov Chain Monte Carlo (MCMC) methods and simulated annealing are employed to minimise this objective function, systematically exploring this configuration space, with coupling constants as proxies for objective weightings. Multi Criterion Decision Analysis (MCDA) and Pareto Optimality is then utilised to remedy the ambiguity in choosing a certain objective weighting combination over others. With respect to proportional representation and compactness objectives evaluated in County Cork, COTHROM consistently improves on the existing legal constituency boundaries for a range of objective weightings.

[AI-288] Measuring the Occupation-Level Impact of AbbVie Intelligence: AI Applicability Analysis 2024-2025

链接: https://arxiv.org/abs/2606.20635
作者: John Regan,Jon Stevens,Brian Martin
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures, 5 tables, 4 GenAI prompts

点击查看摘要

Abstract:This paper presents an empirical analysis of AbbVie Intelligence’s measurable impact on employee work activities across 192 distinct occupations in 2024 and 2025. Drawing on 598,744 de-identified AI conversations classified according to the O*NET Intermediate Work Activity (IWA) taxonomy, we compute occupation-level AI Applicability Scores that quantify the extent to which AI tools can meaningfully assist or automate real work at scale. Three convergent analyses are conducted: (1) longitudinal year-over-year trends from 2024 to 2025, (2) a quasi-experimental pre-post evaluation of the AbbVie Intelligence version 3 platform release in August 2025, and (3) a pre-post evaluation of the AbbVie AI Learning Summit held in November 22025. Results demonstrate statistically significant improvements across all three dimensions. Mean AI Applicability Scores rose substantially from 2024 to 2025; the platform release product a +10.0% gain (p0.001); and the AI Learning Summit produced a +6.68% gain (p0.001). These findings establish that both technological platform enhancements and structured enterprise AI eduction programs independently and substantially expand the reach of AI across the AbbVie workforce.

[AI-289] DEMM-Bench: A Cross-Regime Benchmark for Agent -Runtime Governance-Evidence Sufficiency

链接: https://arxiv.org/abs/2606.20634
作者: Oleg Solozobov
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 41 pages, 8 tables, no figures. Benchmark and dataset paper. Dataset (CC-BY-4.0) and code (Apache-2.0): Zenodo doi: https://doi.org/10.5281/zenodo.20426092 and Hugging Face this https URL code at this https URL

点击查看摘要

Abstract:Agent-runtime systems emit traces, ledgers, provenance graphs, policy logs, delegation tokens, cache events, and tool-firewall records, but those containers do not necessarily answer governance questions about a specific decision. DEMM-Bench is a cross-regime benchmark for agent-runtime governance-evidence sufficiency, grounded in the Decision Evidence Maturity Model (DEMM): it measures whether records across eight evidence regimes are sufficient to reconstruct decision-level properties rather than merely present. The benchmark normalizes the regimes through adapters, asks property questions over actor, authority, action, policy, decision basis, resource touch, lifecycle context, and verification strength, and applies eight deterministic degradation conditions. Across 64 manuscript cases, trace-present and schema-present baselines overclaim on 75% of cases, ledger-present overclaims on 50%, and the redacted property-level candidate scorer has zero overclaim with 56.25% mean Property Sufficiency Accuracy. The deposited package provides the 64-case dataset, construction-oracle labels, baselines, and adapters, supporting reproducible evaluation of decision-evidence maturity across heterogeneous agent-runtime evidence substrates.

[AI-290] Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents

链接: https://arxiv.org/abs/2606.20631
作者: Boming Xia,Liming Zhu,Zhenchang Xing,Qinghua Lu,Dino Sejdinovic,Xiwei Xu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agent skills externalise reusable agent-facing behavioural knowledge and guidance as persistent artefacts that can be discovered, activated, and interpreted by LLM agents. Although a skill artefact is static at rest, its architectural responsibilities arise in use, when the artefact is selected for a run, bound to context and authority constraints, interpreted by a stochastic agent, and recorded as run evidence. We call this run-specific relation skill-in-use. This paper studies agent skill harnessing: the architectural responsibilities that govern the transition from skill artefacts to skill-in-use, bound the executable consequences associated with skill-in-use, and capture evidence for attribution, verification, repair, and evolution. This paper provides a catalogue of ten empirically grounded architectural patterns (five core, five supporting) for skill harnessing and synthesises them into a reference architecture with four responsibility layers: Supply Chain, Mediation, Execution Control, and Evidence Feedback. We evaluate the architecture through cross-instantiation across 8 selected systems. The resulting patterns and reference architecture provide a vocabulary and diagnostic frame for analysing skill-harnessing responsibilities across agent systems.

[AI-291] Human Decision-Making with AI Assistance under Correlated Features

链接: https://arxiv.org/abs/2606.20628
作者: Yanru Guan,Naveen Raman,Fei Fang
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Humans increasingly make decisions with AI assistance; for example, doctors may follow AI-recommended diagnostic tests and base their diagnoses on the results. A natural question is which tests should AI recommend to balance short-term decision quality and long-term human learning when different features (e.g., test results) are correlated. While prior work establishes that stationary policies that recommend the same tests repeatedly are optimal when features are independent, we prove that feature correlations lead such policies to perform arbitrarily poorly. Instead, we prove that any optimal policy must follow an explore-then-commit structure; initially, the AI should offer diverse tests so humans can learn accurate feature coefficients, then the AI should commit to a single set of tests, with exploration length that depends on the degree of feature correlation. We prove that computing the optimal policy is NP-hard and derive a dynamic programming-based algorithm that finds the optimal policy for finite horizons. We additionally develop an approximation that plans for shorter horizons and appends a stationary suffix, achieving near-optimal performance. Our empirical results complement our theory by showing that stronger feature correlation leads to longer exploration phases.

[AI-292] Latent Goal Prediction from Language for Model-Based Planning

链接: https://arxiv.org/abs/2606.20627
作者: Samuel Barbeau,Simon Roy,Giovanni Beltrame,Christian Desrosiers,Nicolas Thome
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages. Preprint under review

点击查看摘要

Abstract:Planning with world models is bottlenecked by compounding prediction errors and the difficulty of defining optimizable goals. Visual targets provide precise local gradients but poor distant guidance, while language is flexible yet limited by noisy cross-modal alignment or dependence on large generative models unsuited for the high-sampling nature of model-based planning. To address these challenges, we introduce Latent Goal Prediction from Language (LAGO), a framework that predicts both sequences of intermediate goal states from language instructions and action-conditioned rollouts, all within the same latent space. Rather than optimizing toward a single global objective, LAGO dynamically decomposes instructions into explicitly predicted, locally tractable latent subgoals. By updating these subgoals online and using a soft minimum trajectory cost during planning, LAGO enables an agent to follow coherent latent trajectories over long horizons. Evaluation across multiple environments planning horizons shows that LAGO avoids the sharp degradation of prior methods. By achieving robust and precise long-horizon planning purely from language, LAGO bridges the precision of visual goals with the flexibility of text-guided control.

[AI-293] Efficient Safety Benchmarking via Item Response Theory

链接: https://arxiv.org/abs/2606.20626
作者: Fabio Spagliardi,Mírian Silva,Ayan Datta,Aiden Zhou,Vamshi Bonagiri,Diogo Cruz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety benchmarks for language models are typically evaluated using static paradigms that treat all items as equally informative for all models, an assumption that is particularly problematic for adversarial, highly heterogeneous safety items. Applied in full to modern benchmark suites, the current evaluation procedures would require on the order of 10^5 responses, most of which provide little ranking signal. We analyze a suite of widely used safety benchmarks and make three contributions toward more efficient safety evaluation. First, we show that Item Response Theory (IRT) recovers interpretable structure on safety benchmarks, with ability estimates resolving differences among models that cluster at the ceiling of raw safety metrics. Second, we show that adaptive item selection, which dynamically chooses informative items for each model based on its responses, approximates full-benchmark rankings while reducing evaluation cost by at least 80% on benchmarks where Spearman’s \rho 90% with full-benchmark is attainable, and by up to 99.9% on AIR-Bench 2024. Third, we introduce a practical procedure for extracting a fixed, informative subset of items reusable across models, providing an alternative to adaptive selection with savings of up to 99.8% on AIR-Bench 2024. Together, these results establish that psychometric methods enable benchmark-aware reductions in evaluation costs across the safety evaluation pipeline.

[AI-294] Darwin Mobile Agent : A Roadmap for Self-Evolution

链接: https://arxiv.org/abs/2606.20622
作者: Daniel Beechey,Derek Yuen,Jianheng Liu,Dezhao Luo,Tiantian He,Weilin Luo,Jun Wang,Kun Shao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:The goal of artificial intelligence is to create agents capable of general, adaptive behaviour in open-ended environments. Guided by the “Bitter Lesson”, we argue that the most effective path toward this goal is to systematically remove human priors and allow intelligence to naturally emerge through interaction with a “Big World” that is orders of magnitude more complex than the agent itself. We propose the mobile Graphical User Interface (GUI) as a practical proxy for such a world and introduce Darwin Mobile Agent, an open-source infrastructure designed as a foundation for autonomous reinforcement learning in this domain. This framework addresses the data-collection bottleneck in real-world mobile interactions by using an asynchronous agent-environment loop across parallel cloud-phone instances. We further propose a conceptual roadmap to systematically remove human priors from three fundamental pillars of a self-evolving agent: task curricula, outcome verification, and memory management. We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap: policy optimisation in the GUI domain. This work establishes the practical and theoretical foundation necessary to move toward truly autonomous, self-evolving GUI agents.

[AI-295] Signals in the Noise: Open Source Intelligence (OSINT) for AI Loss of Control Detection

链接: https://arxiv.org/abs/2606.20610
作者: Sarah Bollinger,Nada Aboserie,Amanda Coakley,Chih-Hsuan Lee,Taysir Mathlouthi
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper applies open-source intelligence (OSINT) and cyber threat intelligence (CTI) methodologies to the problem of detecting AI systems operating outside human control. Drawing on a cross-disciplinary literature review and 14 semi-structured expert interviews conducted under Chatham House Rule, the paper develops two threat models, identifies a range of observable traces, and proposes an institutional architecture for monitoring. The research finds that OSINT-based detection of loss of control is partially feasible and worth building now. Three detection vectors emerge as highest priority: transcript-based collection of user-reported AI behaviour; infrastructure correlation for unexpected external connections or replication; and output analysis for capability concealment. The paper argues for a dedicated, federated international monitoring capability anchored in OSINT methods and independent of frontier AI developers, and identifies sustained non-industry funding as the highest-leverage structural intervention available.

[AI-296] Understanding Privacy by Formalizing It

链接: https://arxiv.org/abs/2606.20609
作者: Réka Markovich,Truls Pedersen,Marija Slavkovik
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Published in Proceedings of the 17th International Conference on Juris-informatics, JURISIN 2023

点击查看摘要

Abstract:In most of the modern societies, there is a broad consensus regarding the need for promoting privacy and thus placing restrictions on technological-including AI-developments to protect people’s right to privacy. In order to meet these expectations on the algorithmic level, first we need to make the concept of privacy and the related or derived rights formally specified. However, the notion of (the right to) privacy is subject to different interpretations. In this paper, we use a multi-modal logic to provide an initial formalization of different theories, basic principles and their implications investigating the right to privacy as an epistemic right within the theory of normative positions.

[AI-297] rust in Generative AI for Health Information Consumption and the Effect of Learned Dependency: An Experimental Study

链接: https://arxiv.org/abs/2606.20605
作者: Arif Ahmed,Gondy Leroy,Agrim Sachdeva,Philip Harber,Stephen A. Rains,Seokjun Youn,Prosanta Barai
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative artificial intelligence is increasingly used for health information, but inaccurate outputs raise concerns about trust calibration and overreliance. This study examines whether learned dependency on generative artificial intelligence affects trust in AI-generated health information and whether visual attention cues reduce overtrust in incorrect outputs. We conducted a randomized 2 by 2 experiment with 338 participants, manipulating information accuracy and visual attention cues. Trust and dependency were measured using survey scales, and linear regression models tested main and interaction effects. Information accuracy increased trust, and learned dependency was positively associated with trust. The interaction between accuracy and dependency was significant, indicating weaker trust calibration among highly dependent users. Visual attention cues did not significantly affect trust or moderate the effect of dependency. The findings suggest that learned dependency weakens trust calibration and increases susceptibility to incorrect AI-generated health information.

[AI-298] he New Associationism: Lessons from Deep Learning

链接: https://arxiv.org/abs/2606.20600
作者: Daniel Rothschild
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:What can the success of modern AI tell us about how humans learn? This paper argues that taking AI seriously as a model of human learning supports a modest but genuine associationism. The central finding is that supervised learning – learning driven by evaluative feedback – underlies a surprisingly wide range of contemporary AI systems, from large language models to game-playing agents, differing primarily in how much work is required to generate the relevant feedback signal. This vindicates associationist ideals of a uniform, gradual, error-driven learning mechanism operating across domains, and defuses the once-influential argument that associationist mechanisms are too limited to account for human cognitive capacities. At the same time, the successes of deep learning depend on computational architectures that go well beyond anything classical associationists envisaged, and supervised learning operates within these as one component rather than a complete account of learning.

[AI-299] Beyond Fixed Budgets: Characterizing the Inelasticity and Limitations of Tree-of-Thought Reasoning Strategies

链接: https://arxiv.org/abs/2606.20599
作者: Atkia Mahila,Avinash Maurya,M. Mustafa Rafique,Bogdan Nicolae
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Flexscience’26: ACM HPDC workshop

点击查看摘要

Abstract:Tree of Thought (ToT) search has become a promising direction for improving the reasoning capabilities of large language models, but deploying these methods in practice raises a question that has received little systematic attention: how do different search strategies behave under varying compute budgets, model sizes, and problem difficulties? In this work, we evaluate two representative ToT methods; DPTS, a Monte Carlo tree search based approach, and SSDP, a semantic deduplication based approach, across two mathematical reasoning benchmarks (Math500 and GSM8K), two model scales (Llama-3B and Llama-8B), and four token budgets (3k–10k). Our analysis reveals that the two methods exhibit limitations that pull in opposite directions. DPTS suffers from a cold-start bottleneck at low budgets: it requires sufficient exploration before its value estimates become reliable, making it a poor fit for resource-constrained settings despite strong scaling behavior at higher budgets. SSDP, on the other hand, reaches candidate solutions efficiently but is prone to frontier depletion; its aggressive node merging permanently discards unexplored paths, leaving it unable to improve regardless of how much budget remains. Together, these findings suggest that neither a fixed exploration strategy nor a fixed pruning strategy is sufficient across compute continuum. We argue that effective search for scientific reasoning agents requires strategies that can adapt their behavior based on search progress and available resources.

[AI-300] Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference

链接: https://arxiv.org/abs/2606.20591
作者: Kangkang Sun,Jianhua Li,Xiuzhen Chen,Junyi He,Minyi Guo
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, submitted to an IEEE journal for possible publication

点击查看摘要

Abstract:Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens and a larger target model to verify them in parallel. In distributed edge-cloud inference, however, draft length must be controlled online: longer drafts amortize communication delay but reduce token acceptance, whereas shorter drafts preserve acceptance but trigger more communication rounds. We formulate this tradeoff as a ratio-type optimal stopping problem and prove that the optimal draft length is a finite delay-monotone threshold. The analysis identifies a critical delay below which single-token speculation is optimal and shows that the optimal length grows only logarithmically with communication delay. For time-varying networks, we extend the model to Markov-modulated channels and establish, under a bounded horizon and monotone stopping-region conditions, a state-dependent threshold policy. For unknown environments, we propose UCB-SpecStop, an online control algorithm with gap-free and gap-dependent expected regret bounds of O(L_\max\sqrtK_\maxT\log(K_\maxT)) and O(\sum_k:\Delta_k0L_\max^2\log(K_\maxT)/\Delta_k) . We implement the method on a real edge-cloud testbed with a Jetson Orin Nano Super edge node and an RTX~3090 Ti cloud node, using Qwen and Llama draft–target pairs. Experiments validate the predicted phase transition, with transition points near 83~ms and 111~ms. Qwen matches the geometric prediction, while Llama requires empirical-prefix calibration due to heavy-head acceptance. Across the tested delay grid, UCB-SpecStop reduces per-token latency over SpecDec++ by up to 22.4%, approaches an offline oracle within 0.2–2.4% in communication-dominated regimes, improves over naive UCB by up to 7.5%, removes the 14.0–18.7% gap caused by static tuning under delay drift, and gains 3.0–6.8% with contextual channel-state information.

[AI-301] Optimization-as-a-Service via Multi-Agent Large Language Model for Radio Access Networks

链接: https://arxiv.org/abs/2606.20590
作者: Chaoqun You,Yueyue Dai,Xingqiu He,Yue Gao,Rahim Tafazolli,Yong Liang Guan
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The physical resource block (PRB) allocation in Radio Access Networks (RANs) traditionally relies on case-by-case manual problem construction or, more recently, learning-based artificial intelligence (AI) methods. However, the sixth-generation (6G) RAN environments confront unprecedented service diversity and exponential dynamics, featuring volatile fluctuations in active base stations (BSs), user scale, and stringent Quality-of-Service (QoS) requirements. Faced with such conditions, both manual models and standard AI algorithms remain fundamentally rigid, lacking the flexibility to adapt and self-evolve. To provide a one-size-fits-all solution, we propose treating the PRB allocation problem as an Optimization-as-a-Service (OaaS) provided by a large language model multi-agent (LLM-MA) system. This fundamentally reshapes RAN resource allocation by utilizing agents to dynamically construct optimization problems and automatically determine objectives tailored to real-time scenarios. Our closed-loop architecture, integrating scene understanding, objective generation, solver, and reflection agents, enables context-aware, self-correcting formulation. To eliminate the computational latency of iterative reflection, we introduce a one-shot reflection distillation mechanism, training a lightweight student model to directly predict refined objective parameters. We theoretically bound the performance gap of this one-shot policy. Experimental results demonstrate our framework achieves near-optimal resource allocation with ultra-low inference latency.

[AI-302] Protocol-Aware Tokenization and Architecture Co-Design for Wireless Packet Foundation Models

链接: https://arxiv.org/abs/2606.20587
作者: Swadhin Pradhan,Shazal Irshad,Jerome Henry
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:What matters more for building foundation models for wireless packet traces: the tokenizer or the architecture or both? To answer this question, we build on PLUME Anonymous [2026], which introduced protocol-aware tokenization for 802.11 traces; we scale model depth and transfer the same tokenizer to a fundamentally different architecture family. A deeper GPT (PLUME-DEEP, 24 layers) reaches 98.2% top-1 accuracy, gaining 32 points over the original 12-layer design, while a Mamba-2 state-space variant (PLUME-MAMBA) achieves 96.1% with 1.7x higher throughput and 2x longer context. The key insight emerges from a controlled 2x2 comparison across tokenizers and architectures: changing the tokenizer swings accuracy by 32 points; changing the architecture moves it by only 2. Protocol-aware tokenization is the primary performance lever, and the backbone becomes a deployment knob trading accuracy for speed.

[AI-303] Physical-AI: From Channel Awareness to Environmental Intelligence in 6G Wireless Networks

链接: https://arxiv.org/abs/2606.20583
作者: Farooque Hassan Kumbhar,Kapal Dev,Sunder Ali Khowaja,Alexandros-Apostolos A. Boulogeorgos,Mehdi Bennis,Yuanwei Liu
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional wireless networks rely on instantaneous channel state information (CSI) and react to channel variations without explicitly modeling the physical environment, limiting their ability to handle blockage, mobility, and interference in dynamic deployments. Paradigms such as Integrated Sensing and Communication (ISAC) add sensing capabilities but lack explicit environment modeling and decision-making. In this article, we propose Physical-AI: a new architecture for environment-aware wireless networking, where radio signals enable sensing, modeling, and interaction with the environment in addition to data transmission. The framework proposes a self-supervised spatiotemporal radio foundation model for transforming distributed radio observations into a shared latent environmental representation. Multiple inference heads operate on this representation to estimate key environmental properties, including blockage, user distribution, mobility dynamics, and interference structure. A task-specific neural decision layer maps this representation to proactive, context-aware control actions. By integrating perception, world modeling, and decision-making in a closed loop, the proposed framework goes beyond ISAC and establishes Physical-AI as a promising architecture for intelligent 6G systems. Simulation results show that the proposed predictive framework reduces outage probability and blockage-response latency, particularly under increasing beam-switching delays.

[AI-304] Role-Based Agent ic AI for Intent-Driven Network and Service Orchestration

链接: https://arxiv.org/abs/2606.20580
作者: Juan Parra-Ullauri,Talha Ahmed Khan,Daniel McHugh,Shipra Kapoor,Alistair Duke,Alicia Hey,Andy Corston-Petrie
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Telecommunication networks are increasingly complex due to heterogeneous technologies, diverse service requirements, and growing demands for resource efficiency and business agility. Intent-Based Networking (IBN) and, more recently, agentic AI have emerged as promising paradigms to address this complexity through autonomous network management. However, existing approaches primarily focus on operational orchestration within Operations Support Systems (OSS) and lack an integrated framework that spans Business Support Systems (BSS) and OSS, limiting the realisation of true intent-to-business-to-network coordination. This paper presents a role-based multi-agent architecture (MAS) for end-to-end intent orchestration that mirrors Communication Service Provider (CSP) organisational structures. The proposed framework applies principles of functional decomposition, explicit task ownership, privacy-preserving domain separation, and domain-specific expertise within a hierarchical four-layer agent system spanning customer engagement, strategic planning, service delivery, and infrastructure provisioning. Leadership agents coordinate planning activities, whilst specialised service and resource agents are dynamically instantiated according to intent requirements. A proof-of-concept implementation demonstrates the feasibility of bridging the BSS-OSS divide through structured agent coordination, illustrating how agentic MAS can support accountable and scalable intent-driven service orchestration. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.20580 [cs.NI] (or arXiv:2606.20580v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2606.20580 Focus to learn more arXiv-issued DOI via DataCite

[AI-305] Human-Less LLM Serving: Quantifying the Human Tax on Throughput

链接: https://arxiv.org/abs/2606.20577
作者: Jianhui Lian,Li Chen,Dan Li,Yong Jiang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 2 figures, 1 table. Preliminary work

点击查看摘要

Abstract:Every major LLM serving system is designed to meet TTFT and TPOT SLOs. These metrics capture latency as a human user perceives it, and the mechanisms built to satisfy them are now standard infrastructure. We observe that long-horizon AI tasks call LLMs programmatically in tight loops where no human observes TTFT or TPOT. We ask: how much throughput do serving systems sacrifice to meet TTFT and TPOT SLAs that these workloads never need? We conduct a systematic measurement study across chunk sizes, SLO settings, context lengths, and concurrency levels. We find that the human tax on throughput grows substantially with context length and lands in the 60-93% range. At 64K token contexts, tightening the TTFT SLO to production-typical settings costs a large fraction of throughput versus the human-less baseline. The human tax is larger at higher concurrency and is qualitatively similar across SGLang and Sarathi-Serve. We term the unconstrained optimum human-less serving and provide a prototype demonstrating that it is practical on real workloads. Our findings argue that serving systems should expose workload-class-aware SLA configurations rather than silently applying the human tax uniformly to all traffic.

[AI-306] LLM -assisted gNB Parameter Configuration for Radio Access Network

链接: https://arxiv.org/abs/2606.20574
作者: Yao-Cong Dong,Maria Amparo Canaveras Galdon,Ari Uskudar,Kuntal Chowdhury,Edwin K. P. Chong,Ray-Guang Cheng
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:gNB parameter misconfigurations are a common cause of system failures in radio access networks (RANs), and their diagnosis and correction rely on manual analysis of complex network logs that does not scale well. This paper proposes a large language model (LLM)-assisted framework for automatic gNB parameter configuration. The framework adopts a synthetic data generation pipeline following a configuration, log, correction workflow. Starting from a workable configuration and the gNB technical references, the pipeline uses a commercial LLM to generate modified configurations and derive structured reasoning traces from gNB error logs. The synthetic training data maps network states to corrective actions and is used to fine-tune an LLM for configuration correction. During inference, the fine-tuned LLM generates valid and deployable gNB parameter configurations from gNB error logs. The framework is validated on an OpenAirInterface (OAI) gNB testbed with 480 unseen misconfiguration scenarios, where fine-tuning improves correction accuracy from 13.8% (zero-shot baseline) to 85.4%, and retrieval-augmented generation (RAG) further improves accuracy to 92.7%. The results demonstrate that the framework may enable automated recovery from misconfigurations without manual intervention and supports scalable and autonomous RAN operation.

[AI-307] AI-Native Network Controller: A Modular Framework for Safe Agent ic Control of Multi-Domain Network Infrastructure

链接: https://arxiv.org/abs/2606.20565
作者: Merim Dzaferagic
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The convergence of multiple network domains, including radio access, optical transport, and core networks, under unified intelligent control is a fundamental requirement for future 6G systems. This is important because existing network controllers remain largely domain-specific, such as the O-RAN RIC for radio, or they lack native support for AI-driven automation across heterogeneous infrastructure. As a result, safe and coordinated agentic control of multi-domain networks is still an open challenge. In this paper, we present the AI-Native Network Controller (AI-NNC), an open-source and modular framework that enables agentic AI control across diverse network domains. The framework is designed around a protocol-agnostic architecture in which each physical device is integrated through a lightweight Python adapter, while control logic is implemented through domain-specific control applications. Beyond closed-loop control, the framework also supports dataset collection, agentic AI experimentation, and coordinated testbed operation using the same validated control and measurement interfaces. This design enables a safer paradigm for autonomous network management, where AI agents operate through validated applications rather than issuing commands directly to network equipment.

[AI-308] A Reproducible Semantic Benchmark for Multivendor DSM-to-CLI Translation

链接: https://arxiv.org/abs/2606.20564
作者: Jerônimo Menezes,Leonardo Bitzki,Diego Kreutz,Gefte Almeida,Marcio Pohlmann,Rodrigo Mansilha
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 10 pages, including 8 tables and 3 figures, submitted to Workshop on Artificial Intelligence in Networks and Communications (WIARC at SBRC 2026)

点击查看摘要

Abstract:Translating high-level network intents into correct multivendor configurations remains a central challenge in network automation, as syntactically valid outputs may still violate the intended operational state. Despite recent advances in Large Language Models (LLMs), the field still lacks reproducible semantic benchmarks for rigorous cross-vendor evaluation. This paper presents a reproducible DSM-to-CLI semantic benchmark covering five cloud LLMs, three vendors, five representative use cases, and ten repeated runs per experimental cell under fixed judges and an explicit failure taxonomy. Our results show that semantic quality and operational reliability are orthogonal, vendor effects dominate use-case effects, and repeated-run dispersion strongly predicts vote instability, with Huawei VRP exposing failure modes hidden by aggregate metrics. These findings demonstrate that multivendor, repeated-execution semantic benchmarks are essential for scientifically rigorous comparison of LLM-based network configuration systems.

[AI-309] Ghost Vectors: Soft-Deleted Embeddings Remain Reconstructible in HNSW Vector Databases

链接: https://arxiv.org/abs/2606.18497
作者: Chandranil Chakraborttii,Jackeline García Alvarado,Sitora Abdulofizova,Shivanshu Dwivedi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, 12 tables. Prepared for submission

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) allows large language models to access external and private corpora for factual, domain-specific responses. Modern RAG pipelines use hierarchical navigable small world (HNSW) vector databases for efficient similarity search. When a user requests data deletion, the systems typically only mark the record as deleted, leaving the embedding on disk physically unchanged. This soft-delete operation raises compliance concerns under data-erasure and retention requirements such as GDPR Article 17 and HIPAA. Analysis on three HNSW implementations confirms that deleted vectors remain physically recoverable by accessing the raw index files at the storage layer, bypassing API access. Using the Vec2Text inversion model without domain-specific fine-tuning, we show this vulnerability on multiple real-world datasets and data modalities. On Wikipedia biographical living persons dataset (BLP), we successfully recover 25.5% of exact person names and 46.4% of geographic locations (ROUGE-L 0.185). Recovery reaches 100% for both patient age and gender markers (ROUGE-L 0.290) on highly structured, sensitive data (NIH Synthea dataset). On soft-deleted image embeddings, we show 100% tissue classification on histopathology patches (p=1.02e-07) and top-1 identity recovery reaches 99% on facial embeddings (p0.01). This work introduces Epoch Key Rotation, which encrypts vectors and discards the key upon deletion. Epoch key rotation reduces observed PII recovery to 0% and completes in 2.5 ms for 500 deleted vectors (approximately 0.005 ms/record). Additionally, it generates an ECDSA-signed cryptographic proof as an auditable record of the deletion event.

[AI-310] Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods NAACL2025

链接: https://arxiv.org/abs/2605.14240
作者: Andrii Shportko,Inessa Verbitsky
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NAACL 2025

点击查看摘要

Abstract:The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.

[AI-311] BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset

链接: https://arxiv.org/abs/2505.10885
作者: Istiaq Ahmed Fahad,Kamruzzaman Asif,Sifat Sikder
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 5 page

点击查看摘要

Abstract:Deepfake audio detection is challenging for low-resource languages like Bengali due to limited datasets and subtle acoustic features. To address this, we introduce BangalFake, a Bengali Deepfake Audio Dataset with 12,260 real and 13,260 deepfake utterances. Synthetic speech is generated using SOTA Text-to-Speech (TTS) models, ensuring high naturalness and quality. We evaluate the dataset through both qualitative and quantitative analyses. Mean Opinion Score (MOS) from 30 native speakers shows Robust-MOS of 3.40 (naturalness) and 4.01 (intelligibility). t-SNE visualization of MFCCs highlights real vs. fake differentiation challenges. This dataset serves as a crucial resource for advancing deepfake detection in Bengali, addressing the limitations of low-resource language research.

[AI-312] Field-level weak lensing cosmology with 100 simulations using multifidelity simulation-based inference

链接: https://arxiv.org/abs/2606.23346
作者: Alex A. Saoulis,Kiyam Lin,Niall Jeffrey,Maximilian von Wietersheim-Kramsta,Davide Piras,Alessio Spurio Mancini,Ana M. G. Ferreira,Benjamin Joachimi
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI)
备注: 19 + 7 pages, 13 + 4 figures

点击查看摘要

Abstract:We perform a realistic KiDS-Legacy mock analysis with field-level neural compression and simulation-based inference using fewer than 100 N -body simulations. The weak lensing shear field encodes substantially more cosmological information than standard two-point summary statistics such as the power spectrum. Field-level inference can fully exploit this information, but physical realism at the field-level requires very high-fidelity simulations. This poses a major challenge for simulation-based inference (SBI): accurate empirical density modelling and deep-learning-based neural compression require many training simulations, but achieving physical realism at the field level makes each simulation extremely costly. We demonstrate that multifidelity SBI can alleviate this tension by substantially reducing the number of high-fidelity simulations needed for accurate cosmological inference. We pre-train neural inference models on realistic KiDS-Legacy-like shear mocks using fast log-normal GLASS simulations and fine-tune them on a small set of high-fidelity N -body simulations. We show that between 60 - 100 high-fidelity simulations are sufficient to obtain informative and well-calibrated cosmological posteriors, enabling an order-of-magnitude reduction in simulation cost for accurate field-level inference in a realistic setting.

[AI-313] Where Is My Physics Wrong? Localized and Identifiable Discovery of Model Discrepancy

链接: https://arxiv.org/abs/2606.23215
作者: Yifan Wang
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Hybrid models combine trusted physics with data-driven correction, but a physical model is rarely wrong everywhere or in the same way. The key diagnostic question is local: where does the model fail, what missing mechanism explains the failure, and is the evidence statistically real? Existing sparse-discovery and discrepancy-learning methods usually fit one global correction, which can spread a local error into clean regimes, bias trusted physical parameters, and provide no calibrated significance for selected terms. We introduce LISDD, Localized, Identifiable Sparse Discovery of Discrepancy, a framework that localizes model error to an operating regime, identifies a sparse symbolic form for the missing mechanism, and certifies the discovery with an exact finite-sample test. LISDD fits the known physics on an automatically detected clean regime, flags discrepant regions with a calibrated residual-energy statistic, selects the local missing term by exhaustive holdout over a candidate library, and confirms significance with a sample-split F -test. A false-discovery-rate extension handles multiple discrepant regions with different missing mechanisms. In controlled experiments, LISDD keeps physical-parameter bias at 0.002 versus 0.43 for global-discrepancy and black-box baselines, raises localization F_1 from 0.44 to 0.80, recovers the correct symbolic form with probability one, attains exact detection, and controls the multi-region false-discovery rate while recovering every planted mechanism. The result is a calibrated diagnostic tool for grey-box building-energy models when a fixed physical law silently breaks in one operating regime.

[AI-314] AI-Empowered UAV-Assisted Backscatter Localization and ISAC for Zero-Energy IoT: A Comprehensive Survey

链接: https://arxiv.org/abs/2606.23125
作者: Ruhul Amin Khalil
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 33 pages, 19 figures, 7 tables. Submitted to Elsevier for Possible Publication

点击查看摘要

Abstract:Zero-energy Internet of Things (IoT) enables passive or near-passive devices to operate on harvested energy rather than batteries. Backscatter communication (BackCom) supports this vision by enabling tags to transmit data via reflection and modulation of incident RF signals, but it suffers from weak reflections, double-path loss, limited coverage, direct-link interference, and dependence on external RF sources. Unmanned aerial vehicles (UAVs) can mitigate these limitations by acting as mobile carrier emitters, data collectors, relays, aerial receivers, mobile anchors, sensing platforms, and edge-intelligence nodes. Integrated sensing and communication (ISAC) further enables the sharing of wireless resources for data transmission, localization, target sensing, and environmental awareness. This article surveys RF-based AI-empowered UAV-assisted backscatter localization and ISAC for zero-energy IoT. It reviews enabling technologies, presents a structured PRISMA-informed methodology, and develops a unified taxonomy covering network architectures, UAV roles, backscatter modes, RF sources, localization and sensing functions, AI techniques, and performance metrics. It also discusses UAV-assisted BackCom, passive localization, ISAC-enabled UAV-backscatter systems, and AI-driven optimization through comparative tables, quantitative trend analysis, coverage evaluation, and tutorial-style numerical illustrations. Finally, it identifies open challenges and future directions in realistic channel modeling, energy-neutral operation, benchmarking, reproducibility, scalable and trustworthy AI, security, privacy, hardware validation, and integration with RIS, MEC, digital twins, and 6G technologies.

[AI-315] Physics-governed executable modelling of triboelectric nanogenerators

链接: https://arxiv.org/abs/2606.23051
作者: Hongfa Zhao,Baiqiao Wang,Tiancong Zhao,Chun Jin,Hanlin Zhou,Mingrui Shu,Minyi Xu,Liwei Lin,Wenbo Ding,Zhong Lin Wang
类目: Applied Physics (physics.app-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
备注:

点击查看摘要

Abstract:Predictive modelling of triboelectric nanogenerators (TENGs) remains fragmented across analytical theories, finite-geometry solvers and disconnected simulation workflows. These disparate approaches must be unified into an executable framework to advance quantitative TENG this http URL we introduce a charge-defined modelling framework and implement it as TENG-CLAW, a physics-governed platform for traceable TENG simulation. The framework establishes a self-consistent electrostatic hierarchy in which triboelectric charges, pre-charging charges and compensating electrode charges serve as defining state this http URL hierarchy connects the infinite plate analytical limit for near-uniform fields with finite-geometry numerical formulations required for edge-dominated devices. Built on this basis, TENG-CLAW converts user-defined research requests into physically admissible simulation tasks, so that generated outputs are tied to explicit charge states, boundary conditions, solver routes and reusable artifacts across spatial, temporal, field-level, comparative and reporting workflows. This work establishes a rigorous computational basis for interpreting TENG mechanisms and provides reproducible research infrastructure for simulation and physics-guided device design.

[AI-316] Domain Adaptation Under Wireless Network Constraints: When Does It Become Green?

链接: https://arxiv.org/abs/2606.23047
作者: Illyyne Saffar,Aurélie Boisbunon,Shruti Bothe
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The deployment of data-driven models in 6G wireless networks is increasingly challenged by frequent distribution shifts that degrade performance over time. Unsupervised Domain Adaptation (UDA) offers an alternative approach by adapting the trained model to a shifted domain without requiring labels. However, UDA pipelines are often more complex than single-task training due to additional modules and optimization procedures, raising a practical question: do the benefits of adaptation come at a higher energy cost, and how does this trade-off compare to retraining when labeling effort is also considered? In this work, we investigate the energy consumption of UDA and compare it to single task. We further propose a way to determine the minimum number of target domains for which UDA becomes more energy-efficient than retraining, taking into account the labeling cost. Our results aim to clarify when UDA should be preferred over classical train-from-scratch approaches from an energy and labeling-aware perspective.

[AI-317] Explainable AI in Speaker Recognition – Attention Map Visualisation and Evaluation

链接: https://arxiv.org/abs/2606.22901
作者: Yanze Xu,Mark D. Plumbley,Wenwu Wang
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Work in progress

点击查看摘要

Abstract:Explaining and understanding the decision-making process of artificial intelligence (AI) systems, particularly those implemented by neural networks, falls within the field of explainable AI (XAI). Analogous to the human attention mechanism, neural networks are assumed to possess their own attention mechanisms that selectively process information during decision-making. This work proposes to study one XAI topic: analysing and visualising the attention mechanisms of neural networks. Our experiments are performed on speaker recognition neural networks that are trained to identify speaker identity from a given utterance. Previous studies have widely used class activation map (CAM)-based methods to analyse and visualise the attention mechanisms of neural networks. Each of these methods produces an attention map for each network input, highlighting which input regions are selectively processed when the speaker recognition network makes decisions. However, the evaluation of attention maps produced by these methods remains largely underexplored. This work systematically reviews an existing attention map evaluation algorithm, establishing key concepts and identifying its shortcomings. On the basis of this existing evaluation algorithm, a new version is then proposed to address the identified shortcomings, called the Modified Randomised Input Sampling for Explanation - Evaluation algorithm (Modified RISE-eval). Using Modified RISE-eval, we evaluate the attention maps produced by two representative CAM-based methods, GradCAM and LayerCAM, applied to a certain speaker recognition network. The evaluation results demonstrate that GradCAM and LayerCAM each exhibit distinct advantages when applied under different experimental conditions in the speaker recognition task. Comments: Work in progress Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP) Cite as: arXiv:2606.22901 [eess.AS] (or arXiv:2606.22901v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2606.22901 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-318] Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking ICML2026

链接: https://arxiv.org/abs/2606.22719
作者: Mao Guan,Qian Chen
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures. Accepted at the ICML 2026 Workshop on AI Forecasting (Forecasting as a New Frontier of Intelligence). Non-archival. OpenReview: this https URL

点击查看摘要

Abstract:Forecasting benchmarks for retrieval-augmented LLMs routinely confound model capability with information leakage: features labeled with a target’s timestamp are often not observable at the system’s decision time. We study leakage-controlled equity factor ranking with a retrieval-augmented 7B open-source LLM forecaster. At each month-end from 2023-04 to 2026-03, the forecaster observes only decision-time information: lag-shifted FRED macro variables, recent macro-event summaries, and the Cleveland Fed’s archived daily CPI nowcast for unreleased current-month inflation. A macro-analog retrieval module selects historical states, a critic LLM compresses them into one tactical rule, and an actor LLM maps the current state and recent rules into scores for seven U.S. equity style factors. The full pipeline obtains a median monthly Spearman rank IC of +0.154, with positive means across three non-overlapping contiguous 12-month subwindows; the mean IC remains statistically underpowered, with a bootstrap 95% confidence interval that includes zero. Non-LLM baselines under the same decision-time constraint demonstrate that a kNN macro-analog model recovers a comparable median IC, indicating that real-time inflation information and macro-similar retrieval explain much of the median signal. The LLM pipeline retains higher mean IC and a stronger long-short allocation sanity check, suggesting that any marginal benefit is concentrated in the extreme rankings that drive long-short portfolio formation. A descriptive audit of the 36 critic rules and per-month case studies appears in the appendix.

[AI-319] Data Evolution by Wittgensteins Rule Following

链接: https://arxiv.org/abs/2606.22674
作者: Aydin Ghojogh,Benyamin Ghojogh
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces Wittgenstein’s Rule Following (WRF) data evolution, a framework in philomatics for evolving or generating a new dataset from a sequence of previously observed datasets. The method is inspired by Ludwig Wittgenstein’s rule-following considerations and his notion of family resemblance in Philosophical Investigations. Unlike standard synthetic data generation, where the goal is usually to sample from or augment a fixed distribution, WRF aims to continue the implicit rule expressed by a historical sequence of datasets while preserving resemblance to the previous datasets. WRF represents each dataset by structural descriptors rather than pointwise correspondences. These descriptors summarize geometric, distributional, clustering, and, in the supervised case, label-based properties of the data. The method predicts a rule-following target by extrapolating descriptor trajectories and a family-resemblance target by averaging historical descriptors. Candidate datasets are then generated from the observed history through balanced or bounded mixture recombination, scored according to these targets, and optionally refined through differentiable optimization in descriptor space. The proposed framework allows both sample size and feature dimension to vary over time and does not assume that the next dataset is a direct transformation of the last one. Simulations on synthetic and image datasets show that WRF can generate meaningful continuations of evolving datasets in both unsupervised and supervised settings. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.22674 [stat.ML] (or arXiv:2606.22674v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.22674 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-320] An LLM -Orchestrated Agent for Directional-Coupler Design with Self-Consistent Eigenmode and FDTD Validation

链接: https://arxiv.org/abs/2606.22493
作者: Saumya Biswas,Amrit De,Md Tauhidul Islam
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a design agent which is a Large Language Model (LLM) that orchestrates, but does not perform, the numerical simulations to design a silicon-on-insulator (SOI) 2\times2 directional coupler. We choose a symmetric phase-matched coupler where a lot of analytical results are available that help the design strategy. The LLM proposes candidate gap values (a geometrical dimension size) and judges convergence, while all physics is owned by deterministic solvers: a frequency-domain eigenmode solver estimates the coupling coefficient~ \kappa for the current design, and an independent Finite-Difference Time-Domain (FDTD) stage validates it. Both solvers operate on a common slab-projected two-dimensional (2D) effective-index reduction of the silicon film, so the design~ \kappa and the FDTD response are consistent by problem design; the residual between them is shown to be a single constant phase offset~ \phi , attributable to a fixed excess coupling length L_\mathrmextra=\SI2.837(11)\micro\meter that we find invariant across a factor-of-two range in~ \kappa . Folding this offset into a closed-loop length correction, the agent delivers a 50/50 splitter whose FDTD-measured cross fraction is 0.498 (target 0.500 ), a residual of 0.0017 . Results are made self-consistent within the 2D effective-index model; and the LLM succeeds in delivering a suitable design over a number of attempts.

[AI-321] Flow Annealing Posterior Sampling for Function-Space Regression and Inverse Problems

链接: https://arxiv.org/abs/2606.22346
作者: Yaozhong Shi,Zachary E. Ross,Yisong Yue
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Principled regression for stochastic processes is a long-standing challenge with deep connections to scientific inverse problems. We introduce Flow Annealing Posterior Sampling (FAPS), to our knowledge the first function-space posterior sampling framework that unifies stochastic-process regression and PDE inverse problems. Built on pretrained function-space flow-matching priors, FAPS enables likelihood-guided posterior inference from sparse and noisy observations, supports variable query discretizations, and avoids explicit prior-density evaluation. Its Langevin correction uses a low-rank covariance preconditioner to exploit dominant function-space correlations across discretizations. Across Gaussian and non-Gaussian stochastic-process regression benchmarks and diverse PDE inverse problems, FAPS produces coherent posterior samples with accurate uncertainty quantification, significantly outperforming existing functional regression baselines and achieving competitive or better PDE noisy inverse performance than diffusion-based posterior samplers while reducing test-time sampling cost.

[AI-322] DSSCNet: A Transfer Learning Framework for Cross-Corpus Dysarthric Speech Severity Classification

链接: https://arxiv.org/abs/2606.22178
作者: Arnab Kumar Roy,Hemant Kumar Kathania,Paban Sapkota,Sudarsana Reddy Kadiri,Shrikanth Narayanan
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Dysarthric speech severity classification is challenging due to speaker variability, class imbalance, and limited datasets. This study introduces DSSCNet, a deep learning model that employs transfer learning and multi-corpus learning to enhance speaker-independent classification. By pre-training on one dysarthric speech corpus and fine-tuning on another, DSSCNet achieves improved feature extraction and cross-corpus generalization. Experimental results demonstrate that DSSCNet outperforms state-of-the-art models for speaker-independent severity classification, achieving 75.80% accuracy on TORGO and 68.25% on UA-Speech, significantly reducing misclassification errors. The findings confirm that leveraging knowledge transfer between datasets improves model robustness, making DSSCNet well-suited for automated dysarthria assessment. This research contributes to the development of more effective assistive speech technologies for individuals with speech impairments.

[AI-323] How Well Do Self-Supervised Speech Models Encode Age and Gender in Childrens Speech? A Layer-Wise Analysis Across Multiple Architectures

链接: https://arxiv.org/abs/2606.22177
作者: Abhijit Sinha,Hemant Kumar Kathania,Mohit Joshi,Harishankar Kumar,Shrikanth Narayanan,Sudarsana Reddy Kadiri
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) models have become a central component of modern speech processing systems, as they enable the learning of rich acoustic representations without reliance on labeled data. Despite their success on adult speech, it remains unclear how effectively these models capture speaker-related attributes such as age and gender in children’s speech, which differs substantially from adult speech due to ongoing physiological and cognitive development. Higher pitch, increased articulatory variability, and age-dependent acoustic changes make children’s speech a particularly challenging domain. In this work, we present a comprehensive analysis of how age and gender information is encoded across layers of four widely used SSL models: Wav2Vec2, HuBERT, Data2Vec, and WavLM. Layer-wise features are extracted and evaluated using a lightweight CNN on two benchmark children’s speech corpora, PFSTAR and CMU Kids. To analyze feature compactness and redundancy, PCA is applied to identify redundancy and highlight the dimensions that contribute most to classification performance. Experimental results show that age- and gender-related information is unevenly distributed across SSL layers, with early to mid-level layers encoding the strongest paralinguistic cues. HuBERT achieves the best overall performance for age classification, while Wav2Vec2 and HuBERT lead gender classification on PFSTAR and CMU Kids, respectively. Beyond single-split evaluation, we further demonstrate that these findings remain stable under speaker-wise cross-validation, layer aggregation, and cross-database evaluation, indicating robustness to data imbalance and domain mismatch. Finally, we show that reliable age and gender classification is achievable even from short speech segments of 1–3 seconds.

[AI-324] Fine-Tuning Large Language Models for Quantum Reasoning

链接: https://arxiv.org/abs/2606.21974
作者: Katherine Ip,Casey R. Myers,Udaya Parampalli,James Quach,Peiyong Wang
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) exhibit abilities beyond natural language modelling and text generation. Recent advances in their reasoning capabilities have spurred interest in applying LLMs to complex scientific tasks requiring deep domain expertise and sophisticated reasoning. Quantum computing, as a highly specialised field with significant knowledge barriers and hardware constraints, could greatly benefit from such advancements. However, a key open question that first must be answered is: How can we develop fine-tuning pipelines that instil genuine quantum reasoning in LLMs, rather than task-specific pattern matching? We study this question through quantum circuit simulation as a training objective, where the model must predict the measurement probability distribution resulting from a sequence of quantum gate operations. We propose and compare two fine-tuning pipelines: (1) Supervised Fine-Tuning (SFT) on explicit gate-by-gate state-vector simulation traces, and (2) a two-stage SFT+Group Relative Policy Optimisation (GRPO) approach that sequentially applies SFT followed by GRPO with verifiable rewards. Our findings show that SFT achieves near-perfect in-distribution and gate-count extrapolation accuracy, significantly outperforming both the base model and the GPT-OSS-120B baseline. SFT+GRPO trades some in-distribution precision for better generalisation to larger qubit systems that SFT alone cannot handle. Both pipelines significantly outperform the baselines, demonstrating that targeted fine-tuning on explicit reasoning traces is an effective strategy for advancing quantum reasoning in LLMs.

[AI-325] Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach INTERSPEECH2026

链接: https://arxiv.org/abs/2606.21215
作者: Tzu-Chieh Wei,Yi-Cheng Lin,Huang-Cheng Chou,Kuan-Yu Chen,Hsin-Yen Sung,Shrikanth Narayanan,Hung-yi Lee
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted by INTERSPEECH 2026

点击查看摘要

Abstract:As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.

[AI-326] QBioFusion-QSAR: Morgan-Anchored Quantum Multiple Kernel Learning for Small-Data Ligand Classification

链接: https://arxiv.org/abs/2606.21213
作者: Azadeh Alavi,Fatemeh Kouchmeshki,Muhammad Usman,Jessica Holien
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Small quantitative structure-activity relationship (QSAR) studies are difficult when close molecular analogues have different activity labels. This paper asks whether a quantum kernel can add similarity information to a Morgan/Tanimoto fingerprint model, and which molecules account for the change. QBioFusion-QSAR uses quantum multiple kernel learning (QMKL): a support vector machine combines a Morgan/Tanimoto kernel with a quantum fidelity kernel constructed from fold-local components derived from RDKit and Mordred descriptors and Deep-PK features. Linear and radial basis function descriptor kernels are included as classical controls. On the 54-molecule PsychLight-A benchmark, Morgan/Tanimoto was the strongest single representation. In the primary stratified five-fold evaluation, QMKL increased accuracy from 0.815 to 0.833 and Matthews correlation coefficient (MCC) from 0.613 to 0.645. Matched-regularization auditing attributed the change to N-Me-5-HT and N-Me-tryptamine changing from false-negative to true-positive predictions; activity-cliff subset MCC increased from 0.07 to 0.22. Repeating the five-fold protocol over ten random partitionings showed that learned QMKL did not exceed Morgan/Tanimoto on mean MCC; paired held-out bootstrap intervals for the matched comparison also span zero. These results support QBioFusion-QSAR as an auditable QMKL framework for identifying localized residual quantum-kernel contributions in small-data, activity-cliff-aware ligand classification.

[AI-327] Communication Heterogeneity and Collective Consensus in Neural Cellular Automata

链接: https://arxiv.org/abs/2606.21202
作者: Nishit Singh
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 figures

点击查看摘要

Abstract:Reaching global agreement from purely local interactions is a defining problem of collective intelligence, and most models of it assume that all agents share a single communication protocol. We ask what happens when they do not. Using a Neural Cellular Automaton in which a population of cells must solve the density classification task, agreeing on a global majority that no individual can observe, we introduce languages'' as sub-populations that read one another's messages through a translation with a tunable linguistic distance’'. We find that linguistic distance slows consensus, that it produces mild divergence between groups rather than full fragmentation, and that a collective whose shared rule was trained under diverse protocols is robust to mismatch; a homogeneously trained one is not. The findings hold on both a ring and a two-dimensional grid, and admit a natural reading as Ising relaxation, in which a foreign-language region acts as a boundary defect that leaves the system in a higher-energy, partially ordered state. These patterns are qualitatively consistent with effects reported in human group studies, suggesting that distance between communication protocols is a minimal mechanism sufficient to produce them, without anything language-specific.

[AI-328] A large-scale foundation model enables simulation-to-real adaptation for nuclear magnetic resonance-based molecular structure analysis

链接: https://arxiv.org/abs/2606.20756
作者: Chen Yang,Zheng Fang,Hanyu Sun,Fanjie Xu,Hongxin Xiang,Hanyu Gao,Xiangxiang Zeng,Yuqiang Li,Xiaojian Wang,Jun Xia
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy is a powerful tool for molecular structure analysis, and spectral artificial intelligence offers great potential for its rapid and automated interpretation. However, the scarcity of experimental NMR datasets has constrained deep learning in this domain to narrow, task-specific applications that lack broad generalization. Here, we introduce UltraNMR, a large-scale foundation model for NMR that leverages the intrinsic properties of NMR spectra to learn generalizable spectral representations. We collected 158 million paired simulated ^1 H and ^13 C NMR spectra to train UltraNMR, employing multiple domain-specific pre-training objectives. UltraNMR captures both intra-spectral and inter-spectral dependencies, enabling seamless simulation-to-real adaptation. We demonstrate that adapting UltraNMR to a range of molecular structure analysis tasks on experimental NMR spectra consistently yields state-of-the-art performance and clearly outperforms UltraNMR variants trained directly on downstream data without simulation pre-training. We also construct a large-scale NMR spectral vector library by encoding simulated NMR spectra using UltraNMR, covering 94 million unique molecules and enabling effective structure-aware retrieval. In real-world applications, UltraNMR facilitates the structural elucidation of two previously unknown natural products from Chinese herbal medicines recorded in the Chinese Pharmacopoeia. These results suggest that large-scale simulation pre-training can effectively bridge the simulation-to-real gap, enabling robust and generalizable molecular structure analysis of real-world NMR spectra.

[AI-329] Empowering Polymeric Materials Discovery by Artificial Intelligence

链接: https://arxiv.org/abs/2606.20753
作者: Chenyao Ma,Linda Zhang,Yuheng Chen,Wei Du,Shangwen Fang,Zihao Jiang,Chuanyu Liu,Xinyu Ma,Rui Su,Gang Wang,Muyao Yu,Dong Zhong,Jie Zhu,Weibo Gong,Huan Gu,Limin Li,Chen Shen,Rui Wu,Zhenghao Wu,Kan Xu,Min Zhou,Donglin He,Xiayun Huang,Shan Jiang,Pengfei Ou,Jiayu Peng,Yuwei Zhang,Jie Zhao,Di Zhang,Piao Ma,Zhenghao Li,Hao Li
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Polymeric materials underpin modern technologies spanning energy storage, microelectronics, healthcare and sustainable manufacturing. Yet their rational design remains exceptionally challenging because material performance emerges from complex interactions among molecular composition, chain architecture, processing history and hierarchical structural evolution across multiple length and time scales. Consequently, polymer research has long relied on labor-intensive experimentation and fragmented modeling approaches, limiting both mechanistic understanding and innovation efficiency. Recent advances in data infrastructure, machine learning, large artificial intelligence (AI) models and laboratory automation are beginning to reshape this landscape. Rather than functioning as isolated tools, polymer databases, predictive models, AI agents and automated laboratories are increasingly converging into interconnected discovery ecosystems. As a result, the central challenge is shifting from improving predictive accuracy alone to enabling reliable decision-making, adaptive learning and seamless integration across computation, experimentation and scientific reasoning. We argue that polymer science is entering an era of autonomous discovery, in which data, simulation, reasoning and experimentation operate within self-improving feedback loops that continuously generate hypotheses, design materials, execute experiments and refine predictive models. By unifying molecular design, process optimization, experimental validation and industrial translation, such autonomous ecosystems establish a more predictive, reproducible and scalable paradigm for polymer innovation, fundamentally transforming how polymer research is conducted.

[AI-330] LLM -Guided Test-Time Discovery of Quantum-Chemical Approximation Algorithms

链接: https://arxiv.org/abs/2606.20729
作者: Masaya Hagai,Yuta Suzuki,Tomoya Murata,Shuhei Kurita,Masaki Adachi
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 main pages, 6 main figures, and 2 tables

点击查看摘要

Abstract:Quantum chemistry simulations underpin modern materials discovery, yet their impact is limited by steep computational cost and dependence on fixed approximation schemes. Foundation models, such as machine-learned interatomic potentials, have accelerated parts of this workflow, but their reliance on large-scale pretraining restricts adaptability at the frontier of chemical space, where methodological innovation and sparse data are the norm. Agentic AI systems can automate existing simulation pipelines, yet they remain constrained by the predefined tools and algorithms they orchestrate. In response, we introduce LADeQ, an LLM-guided workflow that discovers, implements, and benchmarks candidate approximation algorithms at test-time within existing quantum chemistry codes. Rather than selecting from a predefined repertoire, LADeQ constructs candidate approximation schemes on demand, drawing on techniques from disciplines such as spatial statistics, circuit simulation, and kernel methods that have had little prior presence in electronic-structure theory. Because it builds on an out-of-the-box language model, LADeQ requires no task-specific pretraining or curated data, and the resulting implementations are transparent and inspectable, with explicitly traceable approximation errors that enable principled control of accuracy–efficiency trade-offs. We show that LADeQ accelerates coupled cluster singles and doubles (CCSD) and configuration interaction singles and doubles (CISD) calculations while keeping correlation-energy errors within user-specified tolerances, demonstrating autonomous, objective-driven discovery of approximation algorithms inside existing electronic-structure solvers.

[AI-331] Distributed Quantum Learning over Near-term Devices: Convergence Analysis and Security Design

链接: https://arxiv.org/abs/2606.20606
作者: Atit Pokharel,Shaba Shaon,Thomas Morris,Dinh C. Nguyen
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: This paper is to be published in the IEEE Journal on Selected Areas in Communications

点击查看摘要

Abstract:Distributed quantum learning (DQL) has emerged as a promising paradigm to scale quantum-enhanced machine learning by interconnecting multiple quantum devices. However, for efficient real-world deployment, it is essential to characterize how DQL converges under practical scenarios while simultaneously safeguarding multi-device quantum infrastructures from evolving security threats. Addressing these aspects in an integrated manner is key to ensuring both performance and resilience in large-scale DQL systems. Therefore, this paper presents a new DQL study where our innovation lies in: (i) conducting a holistic convergence analysis for DQL under practical settings, i.e., partial device participation, non-convex loss functions, and heterogeneous data distributions, (ii) developing a novel multi-layered post-quantum cryptographic architecture with a quantum neural network-powered adaptive mechanism that monitors conditions, evaluates threats, and adjusts parameters across three National Institute of Standards and Technology (NIST)-compliant levels. Our theoretical framework and empirical validation reveal two key insights: (i) the derived convergence bound uncovers a fundamental trade-off between convergence rate, measurement shots, and the size of the participating device subset; and (ii) findings from our evaluations on a physical testbed modeling quantum control architectures expose the performance limitations of static post-quantum security, while confirming that our adaptive framework effectively mitigates these overheads to preserve overall system efficiency. Specifically, the hardware experiments demonstrate that our dynamic security mechanism reduces total security execution time by approximately 49% relative to static high-security baselines, while maintaining a threat detection accuracy of over 91%. Furthermore, extensive simulations validate our theoretical analysis…

[AI-332] AI Contagion in Social Networks

链接: https://arxiv.org/abs/2606.15206
作者: Olivier Bos,Stefano Bosi
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 49 pages, 2 figures (coded in LaTeX)

点击查看摘要

Abstract:We study how artificial intelligence (AI) interacts with social communication networks to shape the stability of collective knowledge. Agents exchange information through a network while receiving AI-generated content, and AI systems retrain on the aggregate social information they influence. This interaction generates two feedback forces: an AI contagion channel, through which distortions diffuse across the network, and an AI social distortion multiplier, through which retraining amplifies past errors. Despite the high dimensionality of the environment, we show that the long-run behavior of the system admits a two-dimensional representation whose spectral radius determines whether AI-mediated information systems are dynamically stable or unstable. We characterize a sharp regulatory frontier identifying the minimum filtering required for stability and show how network topology shapes systemic informational risk.

[AI-333] Ratio Utility and Cost Analysis for Privacy Preserving Subspace Projection ICASSP2017

链接: https://arxiv.org/abs/1702.07976
作者: Mert Al,Shibiao Wan,Sun-Yuan Kung
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ICASSP 2017

点击查看摘要

Abstract:With a rapidly increasing number of devices connected to the internet, big data has been applied to various domains of human life. Nevertheless, it has also opened new venues for breaching users’ privacy. Hence it is highly required to develop techniques that enable data owners to privatize their data while keeping it useful for intended applications. Existing methods, however, do not offer enough flexibility for controlling the utility-privacy trade-off and may incur unfavorable results when privacy requirements are high. To tackle these drawbacks, we propose a compressive-privacy based method, namely RUCA (Ratio Utility and Cost Analysis), which can not only maximize performance for a privacy-insensitive classification task but also minimize the ability of any classifier to infer private information from the data. Experimental results on Census and Human Activity Recognition data sets demonstrate that RUCA significantly outperforms existing privacy preserving data projection techniques for a wide range of privacy pricings.

机器学习

[LG-0] AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

链接: https://arxiv.org/abs/2606.23689
作者: Mingi Choi,Gunhee Kim,Jisoo Kim,Taeksoo Kim,Taeyun Ha,Jongbin Lim,Hanbyul Joo
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 16 pages, 9 figures. Includes supplementary material

点击查看摘要

Abstract:Learning robust dexterous grasping requires real-world data that records the physical outcomes of grasp attempts. Such data is hard to obtain at scale: teleoperation yields valid physical outcomes but is slow and operator-biased, while simulation-based generation is cheap and scalable but cannot certify contact validity. A natural solution is to generate candidate grasps and verify them on real hardware, but this scales only if the entire collection loop (perception, execution, labeling, and reset) runs without human intervention. We present AutoDex, an automated real-world data-collection system that closes this loop: for each candidate from a replaceable generator, it localizes the object under severe hand-object occlusion with dense 20-camera perception, executes collision-monitored robot motions, labels lift-and-hold success or failure, and actively resets the object between trials to expose additional candidates across stable poses. The result is a reusable database of physically labeled grasp trials that downstream systems can query by retrieval and feasibility filtering. Using AutoDex, we collect 3,593 grasp trials across Allegro and Inspire hands on 100 diverse objects, with synchronized multi-view observations and robot-state logs. For a matched 500-trajectory collection, AutoDex requires 10.3 h versus 49.4 h for teleoperation, yielding a 4.8x throughput improvement, and grasps retrieved from the AutoDex-validated database succeed 76% versus 34% for simulation-only validation. Code and data will be publicly released.

[LG-1] On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

链接: https://arxiv.org/abs/2606.23668
作者: David Mguni,Julian Ma,Jun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are frequently portrayed as general-purpose solvers capable of solving arbitrary tasks. We argue that this view overlooks a fundamental constraint: language is a compressed and capacity-limited interface for conveying task information. Modelling User–System interaction as a bilevel \emphcheap-talk game, we analyse how latent tasks are encoded into prompts and reinterpreted under alignment and safety constraints. We introduce a conceptual decomposition separating task inference from execution and derive PAC-Bayes bounds that distinguish finite-sample estimation error from irreducible structural limitations. Our first main result establishes an \emphexpressivity floor: language acts as a capacity-limited communication channel, and whenever the informational complexity of a task family exceeds the capacity of that channel, distinct tasks become unavoidably indistinguishable to the Solver, inducing a strictly positive error floor that cannot be eliminated by additional data, optimisation, or model scaling alone. We then establish an \emphobjective-misalignment floor: when alignment constraints restrict the admissible output set, the User-ideal distribution may lie outside the feasible class, inducing an irreducible distortion. Together, these results yield a formal negative conclusion: prompt-conditioned LLMs are not universal problem solvers through prompting alone, as there exist task families for which correct behaviour is provably unattainable even in the infinite-data regime. More broadly, our analysis shows the limits of prompt-based generalisation arise from information-constrained communication and alignment-constrained objectives. This suggests that interfaces beyond natural language, including multimodal observations and, external memory, may reduce the inherent LLM limitations by increasing the task-relevant information available to the System.

[LG-2] Dynamic estimation of slowly varying sequences

链接: https://arxiv.org/abs/2606.23655
作者: Prashant Gokhale,Mikhail Khodak,Sandeep Silwal
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Preprint. 14 pages, 4 figures

点击查看摘要

Abstract:We consider the problem of sequentially approximating functions of each element in a slowly-varying sequence, i.e. one where the magnitude \alpha_i of the difference between the elements at positions i and i-1 is small. Recent work on implicit trace estimation shows that when \alpha_t is small, reusing queries to past sequence elements can reduce the overall cost [Dharangutte \ Musco, NeurIPS~2021; Woodruff et al., NeurIPS~2022]. We introduce a framework generalizing this to a variety of linear and nonlinear functions on diverse vector spaces, obtaining novel sequential estimation results for matrix powers, spectral densities, Monte Carlo integration, and a boundary value problem from partial differential equations~(PDEs). Furthermore, we develop a novel algorithm for use with this framework that locally scales the estimation budget with \alpha_t , obtaining sharper path-length-style variation bounds of form \mathcal O(\sum_i=1^m\alpha_i) on the cost of estimating a sequence of length m . This improves upon the previous implicit trace estimation bound of \mathcal O(m\cdot\max_i\alpha_i) [Dharangutte \ Musco, NeurIPS~2021], which is achieved by fixing the query budget using the worst-case \alpha_i and is thus inefficient for stable sequences with rare bursts. Lastly, while all past work assumes a known bound on \alpha_i , we show in certain cases how the changes can be estimated on-the-fly with (nearly) no added cost. In summary, our framework makes the sequential approximation toolkit general-purpose and adaptive while improving upon state-of-the-art-guarantees for dynamic trace estimation.

[LG-3] Muown Implicitly Performs Angular Step-size Decay

链接: https://arxiv.org/abs/2606.23637
作者: Florian Hübler,Kai Lion,Antonio Orvieto,Niao He
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Matrix-aware optimizers such as Muon and Muown have recently shown strong empirical performance for pre-training Transformers. In particular, Muown separates each weight matrix into row magnitudes and an un-normalized direction variable, updating the former with Adam and the latter with Muon. We show that the directional update of Muown is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This explains the step-size stability of Muown and suggests making the angular step size explicit. The resulting method, AngularMuown, optimizes directly over the normalized directions and uses a schedulable angular multiplier decoupled from the radial magnitude update. AngularMuown improves over Muown and, at the time of writing, a preliminary version is leading the per-optimizer category of the modded nanoGPT speedrunning competition. Further experiments on Qwen2-0.5B, and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models. An implementation of the algorithm is available at this https URL

[LG-4] MORL-A2C: Multi-Objective Reinforcement Learning Reranker for Optimizing Healthiness in MOPI-HFRS KDD2026

链接: https://arxiv.org/abs/2606.23603
作者: Aarya Vasantlal,Joshua Zolla
类目: Machine Learning (cs.LG)
*备注: Accepted at the International Workshop on Resource-Efficient Learning for Knowledge Discovery (RelKD) at ACM SIGKDD 2026

点击查看摘要

Abstract:Unhealthy dietary behavior continues to be a persistent public health issue in the United States, exacerbated by recommendation systems that prioritize user preference without considering nutritional health. The Multi-Objective Personalized Interpretable Health-aware Food Recommendation System (MOPI-HFRS), from which this work extends, addresses this by jointly optimizing preference, health, and diversity through Pareto-based optimization. However, this approach relies on static, per-step tradeoff solutions that fail to capture the sequential nature of dietary decision-making. We introduce MORL-A2C, a sequential decision-making extension to MOPI-HFRS targeting the health-preference axis. Leveraging frozen GNN embeddings, MORL-A2C formulates recommendation as a K-step reranking problem using an Advantage Actor-Critic algorithm with a scalarized relevance/health reward. The policy is warm-started via behavior cloning against a dot-product ranker derived from frozen embeddings. We also identify and correct a non-trivial bug in the MOPI-HFRS evaluation pipeline that understated baseline performance; all results are reported against the corrected baseline. On the macro-nutrient benchmark, MORL-A2C achieves a modest reduction in ranking quality (Recall@20: 25.64% to 23.61%, NDCG@20: 23.52% to 20.64%) in exchange for a substantial improvement in health alignment (H-Score@20: 46.05% to 69.57%), with consistent trends on the full-nutrient benchmark. These findings validate that policy-driven sequential optimization can effectively navigate the health-preference trade-off in multi-objective food recommendation.

[LG-5] Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior

链接: https://arxiv.org/abs/2606.23591
作者: Christopher J. Anders,Henrique Da Silva Gameiro,Nico Daheim,Mohammad Emtiyaz Khan
类目: Machine Learning (cs.LG)
*备注: 37 pages, 35 figures, preprint

点击查看摘要

Abstract:One way to understand LLM behavior is to trace its output back to the training data. Two types of measures are commonly used for output tracing: data-similarity and data-influence. The former is cheaper while the latter is believed to be more accurate. Even though many works have compared them for ground-truth tasks, no such comparisons exist for output tracing. Here, we fill this gap and precisely quantify the commonalities and differences between the two measures. We do this by first ranking the training documents according to each measure and then computing the overlap between the two rankings. Our main finding is that the two rankings agree significantly, but there is an asymmetry between them: The top documents of data-similarity are assigned more consistent ranks by data-influence than the other way around. This result is valid across a range of experiments involving OLMo2-1B, Qwen3-1.7B, LlaMa3.2-1B, Gemma3-1B, and GPT2. We exploit the asymmetry to obtain a favorable cost-accuracy trade-off by using the costly data-influence to refine the results of data-similarity.

[LG-6] Its Much Easier for Neural Networks to learn Game of Life Dynamics with the Right Activation Function: Polynomial Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2606.23587
作者: Tashin Ahmed,Q. Tyrell Davis
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Cellular Automata and Lattice Gases (nlin.CG)
*备注: To be published in Proceedings of the 2026 Artificial Life Conference

点击查看摘要

Abstract:Previous work has found a gap between the scale of neural networks that reliably learn Conway’s Game of Life, and minimal networks capable of representing the classic cellular automaton with hard-coded parameter values. Viewing neural network learning as a search process suggests a dependence on networks large enough to contain sub-networks with lucky initializations (sometimes known as ‘winning tickets’) that actually learn the task. In this work, we reorient our perspective from discovering Life rules as a search problem back to a learning problem, and reason that with fitting inductive biases, the problem should be much more amenable to minimal networks. We find that network variants with several alternative activation functions meaningfully outperform the default choice of Rectified Linear Units, and in particular, that a 2nd degree polynomial activation function consistently learns Life dynamics with or without the benefit of learning neural weights. Our results provide an informative demonstration of the benefits of matching learning to the task at hand and challenge the easy default choice of scale for all problems. In particular, we advocate for the use of cellular automata as simple test domains for developing strategies that can benefit machine learning for science, physics-based deep learning, and interpretable machine learning.

[LG-7] A Spectral Theory of Normalized Corrected GNN Propagation

链接: https://arxiv.org/abs/2606.23572
作者: Qihan Chen,Wei Li,Meng Qin,Jianfeng Hou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a spectral theory for \emphnormalized corrected GNN propagation. The object of study is the symmetric normalized adjacency with its degree-stationary component removed, matching the normalization used by standard GCN-style models while isolating the stationary direction most directly tied to oversmoothing. The central theoretical question is whether this corrected normalized operator preserves class-discriminative signal after many propagation layers. Our main result is a high-probability exact-recovery theorem for the binary Contextual Stochastic Block Model after (k=O(\log n)) propagation steps in the dense polylogarithmic regime (p\ge C\log^B n/n), for any fixed (B4), under explicit graph-signal and feature-SNR conditions. We also establish a multi-class partial recovery theorem showing contraction toward class centers for most nodes. Synthetic and real node-classification experiments are included as empirical checks of the theory’s predicted dependence on depth, graph signal, and feature noise.

[LG-8] Patient-Aware Contrastive Learning Preserves Per-Patient Structure in RR-Interval Representations

链接: https://arxiv.org/abs/2606.23570
作者: Yasantha Niroshana,Weijith Wimalasiri,Chathuranga Hettiarachchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive representation learning struggles on physiological signals when each subject contributes a distinct baseline pattern. If class differences overlap with subject differences,class-level objectives such as supervised contrastive learning tend to merge per-subject structure into a single per-class cluster,removing the individual variation that a model needs to generalize to unseen patients. We study this problem in the setting of Paroxysmal Atrial Fibrillation(PAF) detection from RR-interval(RRI) sequences and propose a patient-aware contrastive objective that forms positive pairs only from same-patient, same-class segments, preserving each patient’s own sinus rhythm(SR) baseline while still pushing the two classes apart. Examining the learned embeddings directly, our objective achieves the most consistent per-patient SR structure (cohesion 0.850 vs. 0.800 for supervised contrastive loss (SupCon) and 0.772 for binary cross-entropy (BCE)). We also identify that BCE produces the cleanest global class separation yet the most disordered per-patient structure. This is precisely why a linear probe trained on its features breaks down on unseen patients. On the IRIDIA-AF dataset, the resulting representation reaches a patient-independent Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.989 \pm 0.003 with 2.6\times lower seed variance than supervised contrastive this http URL results highlight that per-subject geometric consistency, rather than global class separability, is key to robust cross-patient generalization.

[LG-9] Simulation-Free Estimation of Traffic Flows from Sparse Count Data

链接: https://arxiv.org/abs/2606.23536
作者: Davide Guastella,Gianluca Bontempi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a method for estimating time-varying traffic flow patterns from sparse aggregated vehicle counts. The method partitions the study area into spatial regions, constructs a set of feasible region-to-region routes, and solves a weighted least-squares optimization problem to determine the number of vehicles to allocate on each route. A weighted contribution matrix encodes sensor coverage, steering the optimizer toward flow configurations that are directly observable by sensors. Edge-level trajectories are then derived by scoring candidate routes against the temporal and volumetric profiles of aggregated regional sensor counts. The method is evaluated on the Brussels road network using real and synthetic traffic data. Results show that the proposed approach reproduces the daily traffic profile in the input data and outperforms the baseline methods at a fraction of the computational cost. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.23536 [cs.LG] (or arXiv:2606.23536v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.23536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Concordia: JIT-Compiled Persistent-Kernel Checkpointing for Fault-Tolerant LLM Inference

链接: https://arxiv.org/abs/2606.23521
作者: Yuhang Gan,Yiwei Yang,Yuyi Li,Xiangyu Gao,Yichen Wang,Rain Jiang,Xiaoning Ding,Andi Quinn,Chen Qian
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-running LLM agents keep valuable state resident on GPUs: KV caches, request schedulers, communication state, and sometimes online adapters. Losing this state after a GPU or communicator failure can discard minutes to hours of work, yet existing recovery mechanisms either restart the whole serving stack or require application-specific checkpoint logic inside every attention and runtime component. This paper argues that fault tolerance for such workloads needs a GPU-resident execution context: checkpoint hooks must run at device synchronization points, observe binary kernels that frameworks and libraries actually execute, and recover without putting the host CPU on the critical path. We present Concordia, a runtime that uses a device-resident persistent kernel as the substrate for fault-tolerant LLM inference. Concordia interposes on GPU module loading and supports PTX- and SASS-level instrumentation, allowing checkpoint and pause hooks to be inserted below framework code and library boundaries. For each registered LLM state region, Concordia JIT-compiles a specialized delta-checkpoint handler – for example, a KV-block scanner, adapter-page scanner, or recovery applier – and hot-swaps it into the persistent kernel’s operator table. The persistent kernel consumes a lock-free ring buffer of compute, checkpoint, append-log, and recovery tasks, so the same always-on executor triggers dirty-page detection, stages deltas, and appends committed records to a CPU-visible log in CXL memory or host DRAM. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2606.23521 [cs.DC] (or arXiv:2606.23521v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.23521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Collapsed Effective Operators for Higher-order Structures ICML2026

链接: https://arxiv.org/abs/2606.23517
作者: Maximilian Krahn,Lennart Bastian,Vikas Garg,Björn Schuller,Tolga Birdal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Higher-order structures are powerful relational modeling tools, yet existing spectral operators decompose the topology into separate ranks, leaving practitioners to fuse the information back to vertices through ad hoc choices. We introduce Collapsed Effective Operators, which condense higher-order degrees of freedom into a single vertex-level operator via Schur complementation of a graded Laplacian. This yields a (generally dense) operator that encodes long-range interactions mediated by topology and is applicable to arbitrary higher-order constructs. We show it preserves positive semi-definiteness with a spectral upper bound relative to the rank-0 Hodge Laplacian, effectively lowering system energy under higher-order connectivity. Empirically, our operator improves spectral clustering, signal smoothing, and enables the inclusion of topological features in neural network architectures via positional encoding. The project page can be found this http URL

[LG-12] Development and Design of FLKit: A Structured Onboarding Toolkit for Federated Learning in Health and Life Sciences

链接: https://arxiv.org/abs/2606.23500
作者: Ashkan Pirmani,Ilse Vermeulen,Goran Vinterhalter,Lotte Geys,Axel Faes,Muhammad Quamber Ali,Nishkala Sattanathan,Geert Vandeweyer,Yves Moreau,Liesbet M. Peeters
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:Federated learning lets institutions train shared models without moving their data, which makes it a natural fit for health and life sciences research under strict privacy regulation. The methods are maturing fast, but the practical barrier now comes earlier: a team starting a federated project meets a scattered mix of frameworks, governance obligations, and unfamiliar roles, with no structured place to begin that fits its own background. FLKit closes that gap. It is an open, community-maintained onboarding toolkit that takes a multidisciplinary team through the full federated learning lifecycle and gives every contributor, clinical, legal, governance, or technical, a role-aware entry point instead of assuming fluency across all four. We modeled it on the ELIXIR Research Data Management Kit and built it with a multidisciplinary core team, a wider consortium supplying milestone reviews and roadmap direction, and external practitioners interviewed to keep the content grounded in real practice. FLKit sits on four lifecycle stages, Governance, Infrastructure, Wrangling, and Analysis, and connects them through 11 role-specific entry points, a cross-disciplinary glossary, a reusable FAIR-aligned FL Story template for planning and documenting projects, and a curated directory of tools, frameworks, and communities. Since the December 2024 demo it has grown to 39 pages across eight sections, with seven FL Stories documenting completed and ongoing projects in multiple sclerosis disability prediction, inflammatory bowel disease, genomics, and brain-computer interfaces. It is openly available at this https URL and welcomes contributions from across the life sciences.

[LG-13] ROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

链接: https://arxiv.org/abs/2606.23496
作者: Matan Ben-Tov,Mahmood Sharif
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Discrete text-trigger optimization – searching for text sequences that, when ingested by a model, steer it toward a specified objective – underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants proliferate, each requiring engineering overhead to use or extend, and remaining hard to compare head-to-head. Together, these raise the bar for adopting optimizers in existing or new domains, and for advancing them via new strategies. We address these gaps with TROPT, the first open-source framework that unifies discrete optimizers’ execution and standardizes their development under a single interface. TROPT makes it easy to customize end-to-end optimization recipes by swapping any component – models, objectives, and optimizers – extending its reach across domains and new applications. TROPT currently ships with 30+ optimization recipes – covering applications such as jailbreaking and probing model internals – built from 15+ optimizers (spanning white-box to black-box access) and 15+ losses, from foundational to state-of-the-art methods. Demonstrating its utility, we leverage TROPT in several studies: (i) controlled, large-scale experiments comparing and enhancing optimization strategies for LLM jailbreaks, revealing potent-yet-underadopted techniques; and (ii) porting optimizers from one domain (e.g., LLM jailbreak) to new domains (e.g., corpus-poisoning embedding model). In all, TROPT significantly lowers the barrier to adopting and advancing discrete text optimization.

[LG-14] Do Location Encoders Capture Spatial Effects? A GeoShapley Benchmark Across Scales

链接: https://arxiv.org/abs/2606.23453
作者: Daniel Kiv,Shaowen Wang
类目: Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, 2 tables, Submitted to SIGSPATIAL '26 short papers

点击查看摘要

Abstract:Location encoders transform geographic coordinates into high dimensional embeddings for downstream machine learning, but it is unclear how well these representations capture interpretable spatial effects. We benchmark whether GeoShapley, a game-theoretic explainer that treats all location features as a single joint player, can recover spatially varying coefficients from models built on location-encoder embeddings. Eleven encoders from the TorchSpatial framework are evaluated against a synthetic process with known coefficients, across three scales (grid, county, global), with and without raw coordinates alongside the embedding, and under untrained and contrastively trained conditions. Measuring recovery as the correlation between estimated and true coefficients, we report how it varies with scale and encoder architecture and compare the embeddings against a raw-coordinate baseline. Recovery of the primary coefficient is consistently high across encoders, whereas recovery of a secondary coefficient is more scale-dependent, differing most at the global scale; the raw-coordinate baseline remains competitive throughout.

[LG-15] Selective Time Series Forecasting via Metalearning

链接: https://arxiv.org/abs/2606.23448
作者: Ricardo Inácio,Vitor Cerqueira,Marília Barandas,Carlos Soares
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Deep learning methods have achieved state-of-the-art in time series forecasting, yet their accuracy varies considerably across samples, as some instances remain inherently difficult to predict. Reject option mechanisms, which allow models to abstain from high-risk predictions, are well established in classification and regression but underexplored in forecasting. Existing abstention strategies typically rely on proxies, such as the width of the prediction interval or learned confidence scores derived from forecasts. However, these approaches are inherently tied to the training domain, limiting their ability to generalize. We propose a selective forecasting framework that addresses this limitation by modeling the empirical percentile of forecasting errors, that is, a scale-invariant statistic, based on structural characteristics extracted from recent lags via metalearning. By decoupling the rejection decision from the forecast itself and grounding it in domain-agnostic features, the framework enables effective abstention transfer across heterogeneous time series. Experiments in both in-domain and transfer learning settings show that rejecting samples predicted as challenging consistently improves forecasting accuracy across coverage levels.

[LG-16] SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors

链接: https://arxiv.org/abs/2606.23444
作者: Pratyaksh Rao,Wancong Zhang,Randall Balestriero,Yann LeCun,Giuseppe Loianno
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Accurate dynamics models are critical for informed decision-making in robotic systems, particularly for agile aerial vehicles operating under uncertainty. Neural network dynamics models are attractive for capturing complex nonlinear effects, but existing predictive approaches struggle with long-horizon forecasting because their autoregressive rollout mechanism amplifies errors over time. Joint Embedding Predictive Architectures (JEPAs) offer a compelling alternative by modeling dynamics in latent space, yet prior JEPA-style methods for robot navigation have been studied primarily for kinematic-level planning, with limited investigation in high-frequency control. In this work, we introduce the JEPA-style model for real-time quadrotor control. The proposed approach combines a latent dynamics model with a novel physics-inspired prober that maps frozen latents to interpretable state, enabling physically grounded long-horizon prediction. Additionally, we combine the learned model with a sampling-based optimal control solution to take advantage of its predictive capabilities for real-time control on embedded hardware. Finally, to reduce the dependence on expensive and unsafe real-world data collection, we develop a structured pipeline for automated dataset generation. Extensive open-loop and outdoor closed-loop experiments demonstrate accurate prediction, robust zero-shot sim-to-real transfer, and strong generalization across diverse operating conditions.

[LG-17] Interpretable Kolmogorov-Arnold Network with Feature-Isolated Temporal Attention Mechanism for Electricity Load Forecasting

链接: https://arxiv.org/abs/2606.23425
作者: Jinhao Li,Hao Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Accurate electricity load forecasting is a crucial prerequisite for stable power system operations. While prevalent deep learning models present competitive performance, they often operate as black boxes and lack interpretability. While the Kolmogorov-Arnold network (KAN) has emerged as a promising alternative because of its learnable activation function design, its direct application to time-series forecasting faces challenges in modeling complex temporal data patterns. Also, simple integration into existing architectures, such as serving as replacement of neural modules, cannot fully leverage KAN’s interpretability strengths. To address these gaps, this study develops LoadKAN, a novel hybrid and interpretable framework for load forecasting that synergistically combines a specifically-designed feature-isolated temporal attention mechanism with a KAN module. The attention stage aims to extract temporal dynamics from each input feature independently, such as historical load and human mobility, providing distilled feature representations to the KAN module for interpretable predictions. When evaluated on datasets from three representative U.S. electricity markets, our LoadKAN remains highly competitive when compared to extensively-tuned, state-of-the-art, black-box deep learning benchmarks. More importantly, LoadKAN’s interpretability enables a granular analysis of the learned non-linear relationships between six distinct mobility patterns and electricity load. Through KAN-learned activation functions, our quantitative sensitivity analyses on mobility features reveal complex and market-specific dependencies. These findings further demonstrate the ability of our LoadKAN to generate insights often obscured by opaque black-box neural forecasting models.

[LG-18] Leverag ing Similarities in Multi-Armed Bandits

链接: https://arxiv.org/abs/2606.23414
作者: Khaled Eldowa,Thibaud Rahier,Augustin Cablant,Panayotis Mertikopoulos,Pierre Gaillard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many online learning and bandit problems, the actions we consider possess inherent similarities–for instance because they share latent traits, tags, or hierarchical structure. We study online learning with a similarity-structured action set, encoded by a rooted tree whose leaves are the actions and whose levels quantify how closely two actions are related. The loss sequence is assumed tree-compatible: losses of similar actions are constrained to be close. We establish an impossibility result showing that usual one-point bandit feedback cannot, in general, leverage range or tree-induced similarity, even under very strong similarity constraints. We then provide a unified set of algorithms which adapt to a wide range of richer feedback models, from semi-bandit feedback down to multi-point bandit protocols, including the minimal two-point feedback setting. We show these algorithms exhibit best-of-both-worlds guarantees and provably exploit action similarities by replacing the number of actions K by a similarity-aware effective number of actions K_\mathrmeff in the regret bounds. As an application, we show that under two-point feedback, it is possible to achieve \sqrtT regret in Lipschitz bandits when d \leq 2 .

[LG-19] Differential Spectral Damping Gap Adaptive Regularization for Ill-Conditioned Kernel Methods

链接: https://arxiv.org/abs/2606.23407
作者: Praveg Vashishtha
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 tables, 1 figure. Complete source code and experiments are available at this https URL

点击查看摘要

Abstract:Kernel methods requiring matrix inversion – particularly Least-Squares Twin Support Vector Machines (LSTSVM) – suffer from exponential eigenvalue decay in their system matrices, producing severely ill-conditioned problems where standard Tikhonov regularization applies uniform damping regardless of eigenvector reliability. We propose Differential Spectral Damping (DSD), a regularization formula that adapts its penalty to localized eigengap structure: preserving eigenvectors with large spectral gaps (reliable per Davis-Kahan perturbation theory) while aggressively suppressing those with small gaps (directionally corrupted beyond recovery). We motivate DSD through a principled design procedure grounded in the Davis-Kahan \sin(\Theta) theorem, systematically deriving the requirements for a reliability-aware damping function and selecting the exponential form for its smoothness, differentiability, and natural saturation properties. Through rigorous paired testing with fairly optimized baselines (including gradient-optimized Tikhonov receiving equal optimization opportunity), we demonstrate that DSD improves LSTSVM classification accuracy by +4.8 percentage points on real-world GINA ( d=970 , Cohen’s d = 4.49 , p 0.0001 ), +10.4 percentage points at d=200 , and +2.6 percentage points on Madelon ( d=500 ) – all using only principled spectral initialization while Tikhonov receives grid search. For pre-image reconstruction on manifold data, DSD ties Tikhonov at high perturbation noise ( p=0.99 ) but slightly underperforms at lower noise levels; both reduce naive inversion error by 66\times . We characterize the precise operating regime ( d \geq 100 , condition number 10^3 ) and document where simpler methods suffice, providing practitioners with clear deployment guidance.

[LG-20] Physics-Informed Modeling for Wood Thermal Analysis and Prediction

链接: https://arxiv.org/abs/2606.23402
作者: Jingren Xie,Alex John Buckthal,Ryan Anthony O’Connor,Isak Worre Foged,Dim P. Papadopoulos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wood materials exhibit complex, spatially varying thermal properties that challenge traditional architectural assumptions of material homogeneity. Although data-driven approaches can directly map wood RGB images to their corresponding thermal responses, they operate as uninterpretable black boxes that prioritize statistical correlation and may absorb experimental noise rather than thermodynamic plausibility. To address these limitations, we present physics-informed deep learning frameworks that integrate partial differential equations (PDEs) to predict pixel-level thermal responses of spatially heterogeneous wood materials using wood RGB images and testbed temperature maps. Specifically, we investigate two distinct approaches to enforcing a normalized 2D steady-state heat transfer equation derived from the general heat transfer equation: Physics-Informed Convolutional Neural Networks (PICNNs), which embed physics as a soft penalty term in the loss function, and Physics-Integrated Convolutional Neural Networks (PInteCNNs), which hard-code an analytical approximator-predictor-corrector solver directly into convolutional neural networks. To validate our proposed approaches, we collect three real-world multimodal datasets of Poplar, Grandis Cross-Cut (Grandis-CC), and Grandis Radial-Cut (Grandis-RC) wood samples. We further demonstrate that embedding physical inductive biases successfully balances predictive accuracy, physical interpretability, and intra-species diversity, outperforming data-driven approaches in handling complex wood material heterogeneity and enabling the extraction of interpretable physical parameters. Project: this https URL

[LG-21] FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

链接: https://arxiv.org/abs/2606.23370
作者: Yinpeng Wu,Yitong Chen,Lixiang Wang,Jinyu Gu,Zhichao Hua,Yubin Xia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:Device-side Large Language Models (LLMs) have grown explosively, offering stronger privacy and higher availability than their cloud-side counterparts. During LLM inference, both the model weights and the user data are valuable, and attackers may compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead to both the secure inference and the normal aplications, due to two challenges: the inflexible resource isolation and the inefficient secure resource management. To address these challenges, this paper presents FlexServe, a fast and secure LLM inference system for mobile devices. The key idea is to decouple the access permission from the management permission of secure resources, so that the normal-world OS cannot access them but can still manage them as usual. First, FlexServe introduces a Recallable Resource Isolation mechanism to construct Recallable Secure Memory (Flex-Mem) and a Recallable Secure NPU (Flex-NPU). They can only be accessed by the secure world, but can be efficiently allocated and reclaimed by the normal-world OS. Based on them, FlexServe further introduces a FlexServe Framework to run secure LLM inference in the secure world. It works together with the normal-world OS to perform cooperative secure memory management. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves average TTFT speedups of 10.05X over the strawman and 2.44X over an optimized strawman. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Operating Systems (cs.OS) Cite as: arXiv:2606.23370 [cs.CR] (or arXiv:2606.23370v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.23370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

链接: https://arxiv.org/abs/2606.23364
作者: Yuqing Wang
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: arXiv admin note: text overlap with arXiv:2506.24120

点击查看摘要

Abstract:Training dynamics is central to understanding neural networks, yet its theoretical analysis remains difficult even for simple architectures and becomes substantially more challenging for general modern architectures. In this paper, we propose a convergence framework for analyzing gradient descent (GD) dynamics under a broad family of neural network architectures and datasets beyond the neural tangent kernel (NTK) regime. The framework is formulated at the level of network blocks and covers architectures including pre-normalized multi-layer transformers. More precisely, under mild assumptions, we prove that for almost all initializations, GD with regular learning rates converges to the neighbourhood of a stationary point. This is mainly proved by establishing an iterate-dependent PL-type inequality through analyticity and measure-zero arguments, and by proving Lipschitz smoothness along the GD trajectory through polynomial generalized smoothness and a local relaxed dissipative condition. We further interpret the theorem under Xavier initialization and practical architectural scaling, showing that the learning rate scale depends on the depth and effective bottleneck dimensions rather than the largest width. Finally, we derive structural nondegeneracy implications for residual connections and function composition, and provide a generic characterization of global minimizers within our framework.

[LG-23] SOAP-Bubbles: Structured Weight Uncertainty for Neural Networks

链接: https://arxiv.org/abs/2606.23357
作者: Adrian Robert Minut,Nico Daheim,Marco Miani,Mohammad Emtiyaz Khan,Wu Lin,Thomas Möllenhoff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured weight-uncertainty can improve many aspects of deep learning, but it remains costly to estimate and difficult to implement. Here, we show that these issues can be addressed by adapting the SOAP optimizer. Our key idea is to run IVON, an existing diagonal-covariance variational method, in the eigenspace of SOAP’s preconditioner and then use the preconditioner to transform the diagonal estimate into a non-diagonal covariance. The resulting method has costs similar to those of SOAP and requires no drastic changes to training pipelines. We call the posteriors obtained in this way SOAP-Bubbles and our new optimizer Eigenspace-VON (EVON). We show that, for logistic regression, EVON recovers the exact Gaussian covariance and that, for language model pretraining, it yields significantly better results than existing diagonal-covariance methods. Our work makes it easier to estimate more expressive posterior distributions for deep learning at scale.

[LG-24] Superhuman AI for Generals.io Using Self-Play Reinforcement Learning

链接: https://arxiv.org/abs/2606.23348
作者: Matej Straka,Viliam Lisý,Martin Schmid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a superhuman AI agent for this http URL, a real-time strategy game that requires both long-horizon planning and short-term tactics under strong imperfect information. Trained for four days on 4x NVIDIA H200 GPUs, our agent reaches #1 on the public 1v1 leaderboard of over 5,000 human players, leading the second-ranked player by the same margin that separates second place from 25th, and beats the two top-ranked humans head-to-head with a combined 199-70 record across 269 ladder matches. A key enabler is a JAX-native simulator that reaches tens of millions of frames per second on a single GPU, roughly a 10,000x speedup over the prior simulator. On top of this, we train a vision transformer policy end-to-end by self-play with a policy-gradient loop and sparse win/loss reward, using top-advantage sample filtering and an exponential moving average of the policy parameters. Taken together, our findings highlight what matters, and what does not, once a fast simulator removes the data bottleneck.

[LG-25] GRIMIP: A General Framework for Instance-Specific Configuration of MIP Solvers Using LLM s

链接: https://arxiv.org/abs/2606.23299
作者: Yidong Luo,Xuemin Chen,Chenguang Wang,Fangzhou Zhu,Tao Zhong,Tianshu Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Configuring the hyperparameters of Mixed-integer programming (MIP) solvers is a high-dimensional, instance-dependent optimization problem where suboptimal settings can degrade solving time by orders of magnitude. Default configurations are often suboptimal, while traditional tuning methods either suffer from the ``cold-start’’ problem and inefficient search or heavily rely on expert experience. This paper introduces \textbfGRIMIP (\textbf\underlineGeneral \textbf\underlineReasoning for \textbf\underlineInstance-specific \textbf\underlineMIP configuration), a novel hybrid intelligence framework that synergistically integrates the semantic reasoning capabilities of Large Language Models (LLMs) with the sample-efficient search of Bayesian Optimization (BO). GRIMIP enables the LLM to function as a complete probabilistic surrogate within the BO loop, significantly improving performance and reducing sampling and evaluation costs. On seven benchmarks including MIPLIB, GRIMIP achieves over 40% reduction in Primal-Dual Integral on hard instances, outperforming SMAC and other LLM-assisted BO methods. By granting LLMs sufficient autonomy, GRIMIP combines the expert-level reasoning of LLMs with the efficient search of BO, achieving state-of-the-art performance.

[LG-26] Non-asymptotic estimates of the minimal risk in statistical learning

链接: https://arxiv.org/abs/2606.23295
作者: Liming Wu(1),Sen Yang(2) ((1) Laboratoire de Mathématiques Blaise Pascal, CNRS-UMR 6620, Université Clermont Auvergne, (2) IASM, Harbin Institute of Technology)
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 42 pages

点击查看摘要

Abstract:In this paper we prove some concentration inequalities for two types of error probabilities in the Empirical Risk Principle (ERP) in statistical learning, which provide a lower bound and an upper bound for the minimal risk (in terms of the minimal empirical risk) with non-asymptotic high confidence. The usual boundedness condition of the empirical risk function is relaxed to the Gaussian or exponential integrability condition. The confidence of the lower bound of the minimal risk is shown to be independent of the number of training parameters and the dimension of the input vectors, allowing one to detect the deficiency of a learning machine efficiently; and the confidence of the upper bound of the minimal risk is proved to be high provided that the sample size n is much greater than the box dimension of the parameter set \Theta in the Orlicz metric d_\psi_1 associated with the risk functions. Our work is based on Talagrand’s concentration inequalities (the sharp versions by Bousquet and Klein-Rio), transport-entropy inequalities and the recent progress in the theory of empirical processes and statistical learning.

[LG-27] Attention mechanism for scalable mesh-based neural surrogates of free-surface fluids

链接: https://arxiv.org/abs/2606.23251
作者: Federico Lanteri,Massimiliano Cremonesi
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-fidelity simulations of free-surface flows using Lagrangian methods such as the Particle Finite Element Method (PFEM) are computationally demanding due to continuous domain updates and repeated solution of the governing equations. This challenge is further amplified by non-Newtonian rheologies, where material nonlinearities increase computational cost. These limitations motivate the development of efficient surrogate models to approximate PFEM dynamics at reduced cost. While data-driven deep learning approaches are promising, a key challenge is designing models that operate on arbitrary and evolving geometries. We propose a self-attention-based neural surrogate for PFEM simulations of free-surface flows. The architecture leverages attention mechanisms to model node interactions and capture complex spatial dependencies, while preserving the PFEM mesh discretization. This provides a geometric and topological framework for remeshing and node redistribution, maintaining high-quality spatial discretization during rollouts, improving long-term stability, and enabling reconstruction of derived mechanical quantities via standard finite element operators. Two attention formulations are considered: a standard self-attention mechanism and a linear variant that reduces computational cost and improves scalability. The models are evaluated on two- and three-dimensional free-surface flow benchmarks with evolving geometries, varying material parameters, and non-Newtonian fluids. Results show accurate prediction of transient dynamics and final configurations, with significantly improved scalability. The mesh-based formulation also enables direct reconstruction of quantities such as stress fields. Overall, the framework provides an accurate and scalable surrogate strategy for PFEM simulations in engineering-scale applications.

[LG-28] Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

链接: https://arxiv.org/abs/2606.23243
作者: Ran Piao,Tsai-Ning Wang,Martijn den Dekker,Linda Moonen,Hareld Kemps,Yuan Lu,Aaqib Saeed
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Clinical audio diagnosis in low-resource settings requires models that identify conditions from minimal examples without large annotated corpora. We propose Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis across federated hospital clients. FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.

[LG-29] Deep learning-based detection of cessation of breathing in pre-term infants ALT

链接: https://arxiv.org/abs/2606.23213
作者: Dineo Serame,Lionel Tarassenko,Mauricio Villarroel
类目: Machine Learning (cs.LG)
*备注: 14 pages main text, 8 figures. Submitted to IEEE Journal of Biomedical and Health Informatics (JBHI)

点击查看摘要

Abstract:Apnoea of prematurity is characterised by recurrent episodes of cessation of breathing and remains difficult to detect reliably using routinely monitored physiological signals in the Neonatal Intensive Care Unit (NICU). Existing bedside monitors rely primarily on respiratory rate and oxygen saturation thresholds, often generating high false-positive alarm rates and missing short or irregular events. Improving automated detection using routinely acquired clinical signals could enhance identification of clinically meaningful events without additional sensing hardware. We evaluated deep learning-based detection of apnoea-related Cessation Of BrEathing (COBE) events using impedance pneumography (IP), electrocardiography (ECG), and photoplethysmography (PPG) signals from approximately 430 hours of NICU recordings collected from 24 pre-term infants. Three independent reviewers annotated COBE events, producing a dataset of 346 COBE and 608 non-COBE events. We compared a shallow convolutional neural network (CNN), residual networks (ResNets), and a ConvNeXt architecture using an independent held-out test set. Across all architectures, detection performance was influenced more strongly by signal modality than by architectural complexity. Unimodal IP-based models achieved balanced accuracies of 86.8-88.0%, outperforming ECG-derived (62.6-69.7%) and PPG-derived (65.1-66.4%) respiratory surrogates. Multimodal fusion yielded modest improvements over IP alone. The best-performing model, a ConvNeXt architecture combining IP and PPG inputs, achieved 88.7% balanced accuracy and an F1 score of 0.75 on the independent test set. These findings demonstrate that deep learning models applied to routinely monitored NICU signals can reliably detect COBE events and highlight the importance of signal modality in data-constrained neonatal monitoring settings. Comments: 14 pages main text, 8 figures. Submitted to IEEE Journal of Biomedical and Health Informatics (JBHI) Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6 Cite as: arXiv:2606.23213 [cs.LG] (or arXiv:2606.23213v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.23213 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dineo Serame DPhil [view email] [v1] Mon, 22 Jun 2026 11:56:50 UTC (1,431 KB)

[LG-30] Efficient Network Inference via Hardware-Aware Architecture Search Model Pruning Quantization

链接: https://arxiv.org/abs/2606.23210
作者: Lucas Heublein,Mark Deutel,Axel Plinge,Felix Ott
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Embedded global navigation satellite system (GNSS) interference monitoring requires fast and memory-efficient inference to process large volumes of raw in-phase and quadrature (IQ) samples in real time. At the same time, increasingly expressive deep neural networks (DNNs) are needed for robust interference classification and characterization across diverse signal conditions. This creates a fundamental tension between predictive performance and deployability on resource-constrained hardware. In this paper, we investigate efficient network inference for GNSS interference characterization using iterative structured pruning, post-training static quantization, and hardware-aware zero-shot neural architecture search (NAS). Starting from MCUNet as a compact baseline, we analyze how model compression and automated architecture optimization affect model size, computational complexity, and memory usage while maintaining task performance. Experiments on a GNSS interference dataset, covering both classification and generalized characterization, show the benefits of combining compression and hardware-aware design for embedded deployment. Our results provide practical guidance for developing compact machine learning (ML) models for real-time GNSS interference monitoring on embedded platforms (iMXRT1062 MCU, Raspberry Pi Zero 2W, and Raspberry Pi 5).

[LG-31] Leverag ing AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks

链接: https://arxiv.org/abs/2606.23208
作者: Leona Hennig,Marius Lindauer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning (DL) has advanced various fields by extracting complex patterns from large datasets. However, the computational demands of DL models pose environmental and resource challenges. Deep Shift Neural Networks (DSNNs) present a solution by leveraging shift operations to reduce computational complexity at inference. Compared to common DNNs, DSNNs are still less well understood and less well optimized. By leveraging AutoML techniques, we provide valuable insights into the potential of DSNNs and how to design them in a better way. We focus on image classification, a core task in computer vision, especially in low-resource environments. Since we consider complementary objectives such as accuracy and energy consumption, we combine state-of-the-art multi-fidelity (MF) hyperparameter optimization (HPO) with multi-objective optimization to find a set of Pareto optimal trade-offs on how to design DSNNs. Our approach led to significantly better configurations of DSNNs regarding loss and emissions compared to default DSNNs. This includes simultaneously increasing performance by about 20% and reducing emissions, in some cases by more than 60%. Investigating the behavior of quantized networks in terms of both emissions and accuracy, our experiments reveal surprising model-specific trade-offs, yielding the greatest energy savings. For example, in contrast to common expectations, quantizing smaller portions of the network with low precision can be optimal with respect to energy consumption while retaining or improving performance. We corroborated these findings across multiple backbone architectures, highlighting important nuances in quantization strategies and offering an automated approach to balancing energy efficiency and model performance.

[LG-32] Bridge the Gaps: Heterogeneous Attributed Graph Clustering via Quaternion Representation Learning

链接: https://arxiv.org/abs/2606.23199
作者: Xinxi Chen,Junyang Chen,Yiqun Zhang,Chuangming Qiu,Xiang Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Emerging Topics in Computational Intelligence. Author-accepted manuscript

点击查看摘要

Abstract:Attributed graph clustering partitions nodes by jointly exploiting node attributes and graph topology. It remains challenging due to attribute heterogeneity and representation degradation during graph learning. Real-world datasets often contain heterogeneous attributes, i.e., numerical and categorical attributes, complicating unified representation learning. This challenge becomes more complex in attributed graphs, where constructing a clustering-friendly graph structure from attributes and topology remains difficult. Under deep graph architectures, repeated graph propagation causes node embeddings to become overly similar, leading to the over-smoothing (OS) effect. Meanwhile, graph representation learning amplifies topological influence, making discriminative attribute information harder to exploit for clustering, an effect we refer to as over-dominating (OD). To bridge these gaps, an end-to-end framework, Any-type attributed Graph REpresentation lEarning (AGREE), is proposed. It unifies attributed graphs and any-type attributed data through multi-level alignment and similarity-based graph construction. Quaternion-based graph convolution strengthens attribute interaction to alleviate OD, while shallow graph architectures help relieve OS. The learned embeddings are jointly optimized for graph reconstruction and clustering, without requiring a predefined number of clusters during training. Experiments on diverse benchmarks show that AGREE achieves strong overall performance in accuracy, robustness, and adaptability.

[LG-33] Stage-dependent integer-binary encoding in factorization-machine black-box optimization

链接: https://arxiv.org/abs/2606.23188
作者: Ryo Ogawa,Mayumi Nakano,Yuya Seki,Shu Tanaka
类目: Machine Learning (cs.LG)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Black-box optimization (BBO) deals with problems where objective functions lack explicit analytical forms and are expensive to evaluate. Factorization machine with quadratic-optimization annealing (FMQA) constructs a surrogate model using a factorization machine (FM) and optimizes it with an Ising machine. Conventional FMQA applies a single integer-binary encoding throughout the optimization process, although the encoding best suited to surrogate learning may differ from the one best suited to Ising-machine solution search. We propose a stage-dependent FMQA framework and derive conversion formulas between one-hot and domain-wall QUBO matrices that preserve the surrogate objective over feasible integer states up to an additive constant. We evaluate the OhDw variant, which employs one-hot encoding for learning and domain-wall encoding for search, on the Rastrigin function with input dimensions N = 2 and 5 and discretization levels q = 61 and 301. Across all conditions, the dominant factor governing optimization performance is the encoding used in the learning stage, with one-hot encoding consistently yielding lower residual errors than domain-wall or binary encoding. The additional benefit of switching to domain-wall encoding for solution search is condition-dependent. For N = 5 and q = 301, OhDw achieves a lower residual error and solutions closer to the global optimum than one-hot-only FMQA, whereas for N = 5 and q = 61 the latter achieves a lower residual error. These results indicate that one-hot encoding in the learning stage is the primary performance driver and that stage-dependent encoding can provide further improvement under finer discretization.

[LG-34] EML Trees Are Universal Approximators

链接: https://arxiv.org/abs/2606.23179
作者: Joe Germany,Elie Abdo,Joseph Bakarji
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The recently introduced EML (Exp-Minus-Log) function acts as continuous analogue of NAND gates, providing a compositional building block capable of representing elementary functions. In this work, we study the expressive power of tree-structured compositions of EML functions. We show that such trees enjoy a universal approximation property for functions in W^k, \infty for k \in \mathbb N , drawing on classical neural network approximation arguments while exploiting the ability to explicitly construct EML trees that mimic polynomial representations. We further propose a learning algorithm for EML-type trees equipped with fitting parameters, and demonstrate its feasibility in practical optimization problems. Our results establish EML trees as a theoretically grounded framework for function approximation.

[LG-35] Position: Correct Answer Wrong Mechanism – When AI Scientists Defend General Claims Their Own Data Contradicts ICML2026

链接: https://arxiv.org/abs/2606.23175
作者: Steven Young Eulig
类目: Machine Learning (cs.LG)
*备注: 8 pages body plus 12 pages references and appendix, non-archival upload for ICML 2026 AI for Science workshop, selected as spotlight paper

点击查看摘要

Abstract:AI scientist systems are described as tools, coauthors, or founders, but we evaluate them as if only the final answer matters. This position paper argues that outcome-only evaluation is insufficient, and that task outcome, mechanism fidelity, and epistemic honesty must be measured separately. Our evidence comes from 28 episodes of a coding agent attempting to rediscover a known particle identification observable in a Geant4 simulation, including an 8-episode probe across two additional frontier models. In 4/20 primary-model and 3/8 cross-model episodes, agents reach right-looking results through incorrect reasoning that breaks when conditions change, which we call Correct Answer, Wrong Mechanism (CAWM). Honesty and mechanism fidelity dissociate within a single agent trajectory. When given a partially misleading prior, all five agents reject the false component on evidence, yet one defends its chosen observable with physics inconsistent with its own data. In the simulation-based discovery setting studied here, coding agents prove reliable tools but unreliable scientific co-authors for open-ended claim-making, where co-author trust requires mechanism-fidelity verification they do not reliably self-apply. The failure is detectable, and we propose a lightweight test. A one-step regime-shift check needs only the agent’s claim and flags the over-generalized cases. A companion recomputation flags the remaining cases when the correct observable is known. Together, these checks flag every CAWM case in this study.

[LG-36] Substitution-Based Analysis of Structural Novelty for Generative Models of Materials

链接: https://arxiv.org/abs/2606.23166
作者: Masahiro Negishi,Aron Walsh
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 27 pages (20 pages of main text). See this https URL for the code

点击查看摘要

Abstract:There has been rapid progress in generative artificial intelligence (AI) models for inorganic crystal design, which can efficiently generate large numbers of candidate compounds after being trained on databases of known crystals. However, it remains unclear whether they genuinely expand the accessible materials search space beyond conventional strategies such as elemental substitution within known structure types. We address this question by developing a workflow to assess whether AI-generated crystals are duplicates of training structures, reproducible by elemental substitution, or unmatched by either criterion. Applying this workflow to representative generative models reveals that 81-92% of chemically valid and metastable generated crystals are either training duplicates or substitution-derived structures. This tendency is particularly strong in high-symmetry crystal systems, even though many possible structural prototypes remain unexplored. Further analysis of the underlying structural fingerprints shows that low-symmetry structures beyond duplication or substitution can be interpreted as interpolation in training-data-rich regions, while high-symmetry duplicates appear to result from memorisation in training-sparse regions. Our findings highlight a limitation in the current generation of models that exhibit a bias towards known structural prototypes in the high symmetry regions, but enable wider exploration of the low-symmetry structural space.

[LG-37] Neural Parameter Calibration for Finite-State Mean Field Games

链接: https://arxiv.org/abs/2606.23155
作者: Anna C.M. Thöni,Grégoire Lambrecht,Gökçe Dayanıklı,Yonathan Efroni,Tal Kachman,Mathieu Laurière
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mean field games efficiently approximate a very large population of strategic agents. While these games can aid the understanding of complex systems, their deployment in real-world settings is challenged by the specification of their parameters: mean field games (MFGs) often involve hidden preferences, constraints, and interactions that can rarely be theoretically derived or directly observed. To address this gap, we present a neural network-based framework for learning parametric, finite-state MFGs from observed population dynamics. To do so, we formulate the parameter calibration as an inverse problem and use implicit differentiation to backpropagate through the games’ equilibrium. The resulting approach is fully differentiable and enables us to estimate flexible trajectory-wise parameter paths, including state- and time-dependent specifications without requiring observations of the individual agents’ actions or rewards. We provide a proof for the exactness of the gradient computation in a discrete-time formulation. We validate our framework through numerical experiments across four systems of increasing complexity, ranging from synthetic linear-quadratic benchmarks to real-world urban mobility datasets.

[LG-38] Weighted Score-Oriented Losses for Temporally Localized Event Prediction

链接: https://arxiv.org/abs/2606.23145
作者: Edoardo Legnaro,Sabrina Guastavino,Francesco Marchetti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operational event-detection systems are rarely assessed by pointwise accuracy alone. In anomaly detection, changepoint detection, and warning systems, the utility of an alarm depends on its temporal position relative to an event. This produces a score-loss mismatch. Neural networks are commonly trained with classical loss functions, such as cross-entropy, whereas deployment decisions are obtained by thresholding network predictions, merging alarms through post-processing rules, and evaluating them with event-based metrics defined by detection windows and false-alarm costs. This paper studies a temporally localized specialization of weighted score-oriented loss (wSOL) for event prediction. Starting from score-oriented losses based on expected confusion matrices and from the weighted SOL framework of Marchetti et al., we consider temporal weights that discount near-event false positives and reduce false-negative penalties when an event is preceded by an admissible alarm. The resulting objective is differentiable with respect to the network predictions, and therefore can be optimized by back-propagation. It can be instantiated with balanced accuracy, true skill statistic, F1, critical success index, and related confusion-matrix scores. We evaluate the proposed approach by comparing cross-entropy, unweighted score-oriented loss, and wSOL on three benchmark datasets for time-series event prediction and detection. The results show that wSOL can improve performance when the evaluation utility is localized in time and is not already encoded by the pointwise labels.

[LG-39] he Fractal Neural Operator: Overcoming Spectral Bias in Chaotic Attractors via Prime-Harmonic Weierstrass Encodings

链接: https://arxiv.org/abs/2606.23123
作者: Kanishk Awadhiya
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Deep learning models, particularly Transformers and Neural Operators, exhibit a well-documented “spectral bias,” effectively acting as low-pass filters that smooth out high-frequency information. While benign in fluid dynamics, this bias is catastrophic for Chaotic Dynamical Systems, where the underlying strange attractor is characterized by fractal geometry and infinite spectral density. We introduce the Fractal Neural Operator (FNO), a novel architecture that utilizes a non-resonant prime number basis to approximate continuous dynamical systems. Unlike geometric encodings ( 2^k ), which suffer from spectral gaps and resonance, our Harmonic Weierstrass Encoder injects infinite spectral resolution into the latent space. We demonstrate that FNO extends the valid prediction horizon of the Lorenz-63 system to 347 Lyapunov times, exceeding state-of-the-art Reservoir Computing baselines by a factor of 2.3x. These results suggest that “chaos” is not inherently unpredictable to neural networks, but rather requires non-differentiable, fractal embedding manifolds.

[LG-40] mporal-Spectral Alignment with Frequency Adaptation for Source-Free Time-Series Adaptation

链接: https://arxiv.org/abs/2606.23120
作者: Shichang Meng,Linquan Wu,Xuan Ai,Linqi Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of source-free domain adaptation (SFDA) for time-series data is to transfer knowledge from a pre-trained source model to an unlabeled target domain without requiring access to source data, while addressing feature shift and temporal drift inherent in the signals. Although existing approaches have explored temporal dynamics in unsupervised source-free adaptation, they largely overlook spectral shifts in time-series data. Towards this end, we propose a novel approach termed temporal-Spectral Alignment with Frequency Adaptation (SAFA) for source-free time-series domain adaptation. Specifically, we first model the source domain at multiple scales by jointly capturing temporal dependencies and spectral characteristics. To adapt time-series data in the target domain, we introduce a trainable frequency adaptation module that modulates the phase and amplitude of target signals in the frequency domain to align them with the source distribution. Extensive experiments on multiple benchmark datasets demonstrate the efficacy and robustness of SAFA.

[LG-41] Minimax Quantile Lower Bounds for Interactive Statistical Decision Making with Privacy

链接: https://arxiv.org/abs/2606.23096
作者: Raghav Bongole,Amirreza Zamani,Tobias J. Oechtering,Mikael Skoglund
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Minimax risk and regret are expectation-based criteria and do not capture rare but consequential failures. To address this concern, we develop a \delta -explicit minimax-quantile theory for interactive statistical decision making (ISDM). We first provide structural relations between minimax quantiles, lower minimax quantiles, and minimax risk. This includes a quantile-to-expectation conversion and an equivalence between strict and lower minimax quantiles outside a countable set of confidence levels. We then derive two converse tools for ISDM: a high-probability interactive Fano’s method and a high-probability interactive Le Cam’s method. Then, we show that mutual-information (MI) privacy can be handled in the same framework by restricting the admissible decision class. For coordinatewise Gaussian privatization, we derive a two-point template that isolates the privacy-induced variance inflation. We instantiate this template for Gaussian mean estimation, and use the same two-point strategy directly for two-armed Gaussian bandits. We then derive a minimax quantile lower bound for the K -armed Gaussian bandit problem, showing that the interactive Fano method captures the exploration cost over multiple possible best arms. The resulting lower bounds are explicit in the confidence level \delta and in the privacy budget for the private problems. They yield \log(1/\delta)/n scaling for squared-error Gaussian mean estimation, \sqrtT\log(1/\delta) scaling for two-armed bounded-mean Gaussian bandits, and \sqrtKT\log(1/\delta) -type scaling for the K -armed bandits, with privacy appearing through a Gaussian variance-inflation factor for the private problems.

[LG-42] FlowTrain: Flow-Based Decoupled Training for Industrial-Grade Vision-Language Models

链接: https://arxiv.org/abs/2606.23087
作者: Zhida Jiang,Zhaolong Xing,Yang Pei,Xiaolong Chen,Yuanhang Xiao,Chengzhi Huang,Xiyu Liu,Haopeng Liu,Qingyuan Sang,Lingfeng Zhou,Jiaxing Wang,Zicheng Zhang,Wenzhe Wang,Xinyu Liu,Yan Li,Zhen Chen,Ke Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industrial-grade distributed training of vision-language models (VLMs) remains far less efficient than that of unimodal LLMs. Existing solutions either follow a monolithic design that assigns uniform parallelism to heterogeneous modules or adopt a disaggregated deployment that separates modules while executing them as a batch-synchronized pipeline. In this paper, we highlight that the above solutions are still not sufficient, and VLM training can be further decoupled. To this end, we present FlowTrain, a flow-based decoupled training framework that reformulates VLM training as a producer-consumer dataflow coordinated through a unified memory pool. The encoder and backbone can progress independently over a global virtual address space. Since this execution decoupling fundamentally changes the optimization objective of allocation and scheduling, FlowTrain further introduces a heterogeneous parallel allocator that assigns module-specific parallelism strategies by solving a throughput matching problem. The dynamic packing scheduler is used to construct balanced microbatches at runtime according to the actual LLM-side computation cost. Extensive experiments on real-world workloads show that FlowTrain achieves over 50% MFU and up to 1.7x throughput improvement, narrowing the efficiency gap to LLM-only training.

[LG-43] PeLAP-A: Adaptive Latent Pruning for Lightweight Latent Diffusion Models

链接: https://arxiv.org/abs/2606.23086
作者: Kissa Zahra,Zaib Un Nisa
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Latent diffusion models achieve strong generative performance by operating in a compressed latent space produced by a variational autoencoder (VAE). However, it remains unclear whether all latent channels contribute equally to the diffusion process, or whether significant redundancy exists. We introduce PeLAP-A (Adaptive Latent Pruning for Diffusion), a lightweight framework that augments a standard latent diffusion pipeline with a learnable channel-wise importance predictor. A two-layer MLP operating on globally pooled latent features produces a soft mask that suppresses unimportant latent channels before they enter the denoising UNet. The entire system is trained jointly on CIFAR-10 under a combined diffusion, reconstruction, and sparsity loss. Experiments reveal a striking result: under aggressive sparsity regularization (lambda = 0.01), the importance predictor drives all latent channels to near-zero yet the denoising UNet achieves lower diffusion loss (0.0236 vs. 0.0240) and lower VAE reconstruction MSE (22.59 vs. 24.67) compared to the unpruned baseline. We term this the sparsity collapse phenomenon and provide an analysis of why it occurs and what it reveals about the information requirements of latent diffusion models. These findings constitute an exploratory study of sparsity dynamics in latent diffusion training, and demonstrate that denoising UNets can remain remarkably robust to latent channel suppression even under aggressive regularization. Code is available at: this https URL.

[LG-44] Counterfactual learning of new adaptive instructional policies using logged data

链接: https://arxiv.org/abs/2606.23015
作者: Samuel Girard(SODA),Sein Minn(AIT),Amel Bouzeghoub(IP Paris, TSP - INF, ACMES-SAMOVAR),Jill-Jênn Vie(SODA)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing instructional policies in Intelligent Tutoring Systems (ITS) typically requires costly online experimentation or student simulators that may fail to capture real-world dynamics. This paper introduces an offline contextual bandit framework that learns new adaptive policies directly from logged interaction data. By mapping student-item interactions onto a continuous latent proficiency-difficulty scale using a Rasch model, we cast the tutoring process as a continuous stochastic bandit problem. We propose a novel reward function designed to optimize ‘‘flow’’ by balancing task challenge with student success. Our approach includes a round-specific behavior policy estimation that serves as both a propensity model for off-policy evaluation and a diagnostic tool for ITS adaptivity. We demonstrate the efficacy of this framework across four large-scale real-world datasets, achieving consistent policy improvements over the logged behavior policy. The results show that effective instructional policies can be learned and visualized within seconds of computation, providing a scalable path for improving adaptive learning systems without further data collection.

[LG-45] A Novel Approach to Temporal QoS Estimation via Extended Kalman Filter-Incorporated Latent Feature Analysis

链接: https://arxiv.org/abs/2606.23010
作者: Ye Yuan,Song Wang,Hongxun Zhou,Ling Wang,Xin Luo
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Predicting temporal Quality of Service (QoS) data is critical for optimizing network services and rationalizing resource allocation in cloud computing and service-oriented systems. Existing mainstream methods have achieved promising predictive performance. However, their purely data-driven manner limits their ability to capture non-stationary temporal patterns, thereby leading to accuracy degradation when temporal QoS data exhibits fluctuations. To tackle this limitation, we propose a novel Extended Kalman Filter-Enhanced Latent Feature Analysis (EKL) model to perform efficient and accurate temporal QoS prediction from the perspective of bidirectional model-data-driven learning. Its main idea is three-fold: a) designing a model-driven feature producer to obtain the temporal latent features to capture the intricate temporal pattern following the principle of an Extended Kalman Filter; b) building a data-driven feature producer based on the alternating least squares algorithm to identify time-invariant latent features describing intrinsic user-service characteristics; c) exploiting a density-oriented parallel strategy that achieves workload balancing by sorting users in accordance with their service invocation density, which effectively elevates computational efficiency. In addition, we provide a rigorous theoretical analysis to formally prove the convergence of the proposed EKL. Experimental evaluations conducted on real-world temporal QoS datasets reveal that our proposed EKL surpasses existing state-of-the-art models with respect to both computational efficiency and prediction accuracy for missing temporal QoS data.

[LG-46] Do Sparse Autoencoders Learn Meaningful Concept Hierarchies?

链接: https://arxiv.org/abs/2606.22994
作者: Nils Grandien,David Steinmann,Felix Friedrich,Kristian Kersting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become an important tool for unsupervised concept discovery in large models. To make the resulting feature spaces more interpretable and manageable, recent approaches have begun imposing hierarchical structure, either explicitly or as an implicit effect of training constraints, yet rigorous comparison remains difficult. There are no agreed-upon requirements for what a meaningful feature hierarchy should satisfy, and evaluation has largely relied on qualitative illustrations with fragmented quantitative protocols. To address this, we derive a set of key requirements for generalization/specialization hierarchies in unsupervised concept discovery, drawing on semantic net and taxonomy research alongside recent SAE work, and use them to derive a concrete evaluation protocol. Applying this protocol to current SAE approaches trained on visual data, we find that while feature spaces generally provide a basis for sensible hierarchies, establishing good hierarchical structure remains challenging. In particular, feature absorption, both in its well-known hard form and in a continuous, soft form, systematically compromises hierarchy quality, pointing to a fundamental tension that future approaches will need to navigate.

[LG-47] aLK: Text-attributed Graph Dataset Distillation via Coupling Language Model with Graph-Aware Kernel

链接: https://arxiv.org/abs/2606.22975
作者: Yeongho Kim,Yeonje Choi,Kijung Shin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-attributed graphs (TAGs) are widely used in many real-world domains, and learning on TAGs requires jointly modeling text semantics and graph structure. A standard approach for modeling TAGs is to combine a language model (LM) and a graph neural network (GNN), but joint training is computationally expensive and difficult to scale. Dataset distillation is a promising way to reduce training costs, but existing methods are not well suited to TAGs because they are typically designed for a single modality or still require repeatedly training expensive LM-GNN models on the full dataset during distillation. To address this, we propose TaLK, an effective dataset distillation method for TAGs that couples an LM with a graph-aware neural tangent this http URL design enables efficient dataset distillation, avoiding repeated joint training on the full dataset while reflecting both textual and structural information for effective TAG this http URL on multiple TAG benchmarks show that TaLK consistently outperforms existing baselines and achieves up to 97% of full-dataset performance with only 1% synthetic data.

[LG-48] opological Out-of-Domain Generalization in Dynamical Systems Reconstruction

链接: https://arxiv.org/abs/2606.22969
作者: Georg Trede,Charlotte Ricarda Doll,Elias Weber,Daniel Durstewitz
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Predicting the behavior of dynamical systems (DS) beyond the dynamical and parameter regimes observed in training is a pivotal and essentially unresolved problem in scientific ML. It is central to any good scientific theory, which we expect to be able to make predictions about regimes not covered by currently available data. Recent hierarchical and hyper-network guided approaches for DS reconstruction (DSR) enable training on many DS simultaneously, and revealed that extracted latent features are often related to crucial control parameters of the underlying DS that varied across the training corpus. However, true out-of-domain forecasting abilities of these models, e.g., across tipping points, remain limited, and fine-tuning, or even full model retraining, on time series from the new dynamical regime is usually required. Here, we mathematically analyze the root of these limitations in previous model formulations and identify three core shortcomings rooted in a mismatch between structural assumptions of the reconstruction model and typical properties of physical systems. We propose a combination of remedies for these shortcomings, most importantly feature splitting, and furthermore derive a closed-form bound on the reliable extrapolation range. We demonstrate empirically that our techniques allow for accurate zero-shot prediction into new dynamical regimes, outside the observed training regime, as, e.g., encountered across tipping points.

[LG-49] DT-GOL: Dual-Track Geometric Online Learning in Nonstationary Environment with Label Delay

链接: https://arxiv.org/abs/2606.22950
作者: Yulin Wang,Yi He,Dianlong You,Di Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online learning is crucial for handling complex data streams in big data applications. Recent research has begun to focus on dynamic scenarios, i.e., non-stationary environments. However, a crucial yet often overlooked aspect is label latency, where new data may not receive labels in time due to the slow and expensive labeling process, thus hindering rapid adaptation to dynamic environments. To resolve this impasse, we propose Dual-Track Geometry Online Learning (DT-GOL), a novel framework that shifts from temporal compensation to spatial reasoning to bridge the supervised latency gap. By modeling the delay challenge as a semi-supervised task, we leverage real-time topological evolution of features as a reliable geometric surrogate for unobservable conceptual changes to achieve proactive supervised adaptation within the delay window. Unlike rigid self-training, we introduce a dynamic evidence calibration mechanism that distills geometric information into soft labels that perceive uncertainty, effectively mitigating the confirmation bias inherent in hard pseudo-labels. Furthermore, to resolve the stability-plasticity dilemma, we design a decoupled dual-track architecture in which a master learner serves as a stable anchor, updated strictly from delayed ground truth, while a transient branch leverages soft geometric knowledge for low-risk forward adaptation. Extensive experiments on real and synthetic datasets demonstrate that DT-GOL significantly outperforms existing state-of-the-art baseline methods, especially in scenarios with concept drift. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.22950 [cs.LG] (or arXiv:2606.22950v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.22950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Neural Operator Processes for Probabilistic Operator Learning under Partial Observations

链接: https://arxiv.org/abs/2606.22946
作者: Jose Miguel Lara-Rangel,Serge Guillas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators learn mappings between function spaces, but are typically developed with dense input-output training fields and fully observed inputs at inference. Many scientific problems require instead predicting solution fields from sparse, irregular, or partial observations under uncertainty. We introduce Neural Operator Processes (NOPs), a framework that unifies neural-process conditioning with neural-operator decoding to predict full output fields from limited context. NOPs condition on sparse joint input-output observations and support deterministic and probabilistic prediction within a shared encoder-decoder architecture. We study two conditioning strategies, convolutional pooled summaries and query-aligned attention, and analyze how their interaction with latent stochastic variables depends on PDE geometry. Across function regression and three PDE benchmarks, we find that sparse conditional operator learning is viable and can match dense-grid behavior in several regimes, that preserving local context-query geometry is essential in non-periodic settings but less so in spectrally smooth periodic regimes, and that uncertainty-aware operator learning succeeds when latent conditioning complements rather than overwrites the local geometric pathway. These results provide a basis for probabilistic operator learning under partial observations and help bridge operator learning and probabilistic meta-learning in function space.

[LG-51] CITADEL: CSI-Based Jamming Detection and Open-Set Classification for IIoT Networks

链接: https://arxiv.org/abs/2606.22939
作者: Aymen Bouferroum(FUN),Ildi Alla(a href=“http://uni.lu” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Valeria Loscri(FUN),Abderrahim Benslimane(AU),Vincent Lenders(a href=“http://uni.lu” rel=“external noopener nofollow” class="link-external link-http"this http URL/a)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Radio frequency jamming poses a critical threat to the availability of wireless Industrial Internet of Things (IIoT) networks. Existing detection and classification techniques are poorly suited to this setting: coarse signal-strength and cross-layer features lack information richness, while raw I/Q baseband approaches require hardware and throughput that is impractical at the scale of hundred-node IIoT deployments. This paper presents CITADEL, a lightweight two-stage hierarchical pipeline that uses only Channel State Information (CSI) measurements, which are natively available on commodity IIoT devices, to detect and classify jamming attacks including previously unseen ones. While prior work has shown that jamming leaves observable CSI signatures, CITADEL is the first system to translate this insight into an end-to-end pipeline that jointly achieves closed-set classification of known attacks, open-set detection of zero-day attacks, and resistance to adversarial evasion. Evaluated across 6 known attack types and 15 zero-day scenarios, CITADEL achieves 100% known-attack detection and 97.1% zero-day detection at a 0.4% end-to-end false positive rate. Under adversarial evaluation spanning white-box and black-box threat models, gradient-based evasion remains below 2% across all tested perturbation budgets and the strongest published CSI attack generator achieves less than 5% average evasion. A systematic comparison against eight baselines confirms that no existing method achieves comparable performance on CSI data across all three axes: detection, generalization, and robustness. The full pipeline completes inference in 14.2 ms at 95.9 mJ on an edge GPU, establishing CITADEL as a practical solution for large-scale IIoT network security.

[LG-52] FORGE: Fused On-Register Gradient Elimination for Memory-Efficient LLM Training

链接: https://arxiv.org/abs/2606.22932
作者: Dikshant Kukreja,Kritarth Prasad,Avinash Anand,Zhengkui Wang,Erik Cambria,Timothy Liu,Aik Beng Ng,Simon See,Bapi Chatterjee
类目: Machine Learning (cs.LG)
*备注: 38 pages, 14 figures, 20 tables

点击查看摘要

Abstract:Reverse-mode differentiation computes every weight gradient, writes it to memory, and only then lets the optimizer read it back. This two-phase schedule sets the memory ceiling of modern training: at the seam between the phases, every layer’s gradient is live at once. We argue that this materialized gradient is an artifact of how differentiation is staged, not a quantity that learning requires – and we eliminate it. FORGE folds the optimizer step into the backward pass and applies it one tile at a time, entirely in registers, so each gradient tile is consumed the instant it is produced and never becomes a tensor. The fusion changes only when the update happens, not what it computes: in full precision the fused step is provably exact – the identical optimizer update, for every element-wise rule – and that exactness survives tensor- and sequence-parallel sharding; in the bf16 and 8-bit regimes used in practice it is faithful rather than bit-identical, its deviation bounded and, for the weight store, rendered unbiased by stochastic rounding. Because each gradient tile is born and consumed in the same registers, it is never converted down to bf16 to be stored and read back; FORGE thus preserves the full-precision fidelity that both bf16 and 8-bit optimizers lose to that conversion. Nor is the method tied to one architecture or one optimizer: linear layers are ubiquitous, and FORGE reclaims the gradient memory of any of them under any element-wise rule. Empirically FORGE more than halves the memory of an optimizer step and, at the small batch sizes typical of fine-tuning and continued pretraining, runs about 1.5x faster; integrated into tensor-parallel Megatron-LM it fits 8B training at four times the micro-batch a standard optimizer allows on the same GPUs.

[LG-53] EEG Benchmarking Needs a Task Specification Layer: NeuroDoc for Rulebook-Guided Executable Benchmark Construction

链接: https://arxiv.org/abs/2606.22925
作者: Chengxuan Qin,Zhige Chen,Shu Peng,Rui Yang,Jiping Cui,Yikai Dong,Jun Li,Liu Peng,Zhida Shang,Mingze Tang,Kay Chen Tan,Jibin Wu
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) foundation models increasingly rely on multi-dataset training and evaluation, yet public EEG datasets still lack a shared task specification layer that can turn heterogeneous recordings into reusable benchmark units. Existing standards organize files, metadata, and provenance, but they do not specify EEG tasks under a common language and rulebook, leaving critical task semantics scattered across papers, code, and manual interpretation. We investigate whether heterogeneous public EEG datasets can be standardized through a structured task specification language paired with a shared rulebook. Our methodology represents each benchmark entry as a task document synchronized with an executable task kernel, with the rulebook defining task fields, evidence requirements, document-kernel alignment, review states, and machine-checkable constraints. Using this methodology, we release a community-reviewed EEG benchmark corpus centered on 53 completed and reviewed entries with 245 task definitions spanning diverse paradigms, and we introduce NeuroDoc and NeuroAudit as the operational support layer for rulebook-guided drafting, upgrading, review, amendment, and release management. We further examine whether the resulting benchmark units can be instantiated in a shared downstream setting across four EEG foundation model backbones, providing execution-based evidence for reusable, auditable, and executable EEG benchmarking infrastructure.

[LG-54] GRAIN: Group Aggregation via Min-Norm Objective

链接: https://arxiv.org/abs/2606.22917
作者: Nghia Bui,Jiarui Yao,Lijing Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning instability is a long-standing problem across machine learning, but it is especially acute in the overparameterized regime that defines modern deep learning: large models fine-tuned or trained on limited data traverse flat loss landscapes with many nearly-equivalent minima, and stochastic factors (initialization, data order, dropout, hardware non-determinism) can route optimization to very different solutions. The rise of large pretrained models (LPMs) makes the problem more urgent: training cost is high, downstream data is often small, and repeated runs for variance reduction are prohibitive. We introduce \textbfGRAIN (\textbfGroup \textbfAggregation via m\textbfIN-norm objective), a lightweight training algorithm that replaces the mean aggregation used in mini-batch optimization (both across mini-batches and within a mini-batch) with a min-norm convex combination of group-wise gradients. \mName guarantees a non-negative inner product between the aggregated update and every group gradient, resolving intra- and inner-batch gradient conflict, and retains an \mathcalO(1/T) convergence rate comparable to SGD. Under mild smoothness and absolute-continuity assumptions, the min-norm solution differs almost surely from the arithmetic mean, which yields a uniform-stability bound for \mName strictly tighter than the standard bound for SGD. Empirically across generation, classification, and regression at LPM scale, \mName delivers consistent improvements in mean performance and reductions in run-to-run variance over a broad suite of tasks, with no extra training-time or storage cost beyond a single backward pass.

[LG-55] PromptDyG: Test-Time Prompt Adaptation on Dynamic Graphs ICML2026

链接: https://arxiv.org/abs/2606.22914
作者: Guoguo Ai,Chaoxi Niu,Hui Yan,Joey Tianyi Zhou,Yew-Soon Ong,Guansong Pang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:Activities in numerous evolving systems can be represented as dynamic graphs in snapshot form at different time intervals, i.e., discrete-time dynamic graphs (DTDGs). Existing methods show impressive advances in capturing historical temporal evolution patterns in DTDGs, but they focus on addressing an offline learning setting, where models are trained using historical snapshots once and then evaluated to all subsequent graph snapshots without further updating. This fails to capture 1) the nature of evolving complexities across graph snapshots and 2) the distribution shift in the testing graph snapshots. To address these problems, we propose PromptDyG, a novel framework that leverages unsupervised test-time Prompt adaptation for Dynamic Graph learning under a live-update online setting. The key insight is that an expressive dynamic graph prompt can be learned on a frozen backbone via minimization of feature-wise, label-free entropy to efficiently and continuously model the evolving patterns. We show theoretically that this unsupervised prompt adaptation can guarantee a larger similarity margin between positive and negative pairs, facilitating more accurate dynamic predictions. It is further confirmed by our extensive empirical results on six benchmark datasets that show consistent and significant improvements of PromptDyG over state-of-the-art baselines.

[LG-56] Learning Graphs through Continuous Information Entropy Fields

链接: https://arxiv.org/abs/2606.22895
作者: Hui Cong,Bo Sun,Ziheng Jiao,Yisheng An
类目: Machine Learning (cs.LG)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Graph theory is inherently descriptive, capturing what relationships exist but not why they arise, because it treats edges as primitive constructs. This paper proposes a new explanatory framework for graph learning, where relationships emerge from latent continuous information entropy fields, and a graph becomes a discrete instantiation of an underlying field. To formalize this field, we introduce the Field-informed Graph Network (FGN). It learns a scalar field from node features and leverages it to modulate message passing. The information-theoretic objective balances structural fidelity with field smoothness, forming a self-reinforcing loop. In this loop, the field modulates information diffusion through field-modulated weighting, and the updated node representations iteratively refine the field. As a result, FGN learns by simulating its own co-evolution. Extensive experiments on node classification and graph classification benchmarks demonstrate superior performance, robustness to perturbations, and structurally coherent field representations.

[LG-57] Physiology-Aware CNN and Zero-Shot Multimodal LLM s for ECG Image Classification: A Comparative Study

链接: https://arxiv.org/abs/2606.22889
作者: Khalil Ahammad,Derek Abbott,Mohsen Dorraki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal large language models (LLMs) are increasingly adopted to interpret 12-lead ECG images, though the interpretations often lack validation. However, ECG image understanding significantly differs from general images as it depends on precise waveform morphology, lead relationships and accurate interval measurements. This study investigated whether zero-shot multimodal LLMs can reliably distinguish normal and abnormal ECG images and, in parallel, evaluated CNN-based models for clinically grounded references. Standard 12-lead ECG recordings were rendered as single-page images for a binary normal-abnormal classification task. Three prominent LLMs (GPT-5.2, GPT-4.1, and Gemini-2.5 Pro) were tested using a fixed zero-shot prompt across multiple runs. In parallel, a physiology-aware CNN-based model was developed with the capability to aggregate features from the predefined anatomical lead groups. The model was compared with ResNet18, DenseNet121, VGG16 baselines, and all the models were evaluated on an internal test set and external PTB-XL dataset. Across seeds, CNN-based models demonstrated stable discrimination, with average internal ROC-AUC of 0.92-0.94, and external ROC-AUC of 0.85-0.86. The proposed LeadGroupECG model significantly improved over its backbone internally without compromising external generalization. It remained competitive with other baselines, while consistently highlighting anatomical lead-group contributions. In contrast, zero-shot LLM discrimination remained near-chance (ROC-AUC around 0.5). The PR-AUC improved slightly when ECGs used a grid-based calibration background compared with the grid-free ECGs. Although multimodal LLMs can generate reasonable ECG narratives, their zero-shot diagnostic discrimination remains limited. Therefore, clinically framed, domain-specific architectures remain essential for AI-based ECG interpretation.

[LG-58] When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents SIGIR2026

链接: https://arxiv.org/abs/2606.22864
作者: Yanhang Li,Zhichao Fan,Zexin Zhuang
类目: Machine Learning (cs.LG)
*备注: 17 pages, 3 figures. Camera-ready version for EvalMG '26, The 2nd Workshop on Evaluation for Multimodal Generation, co-located with SIGIR 2026

点击查看摘要

Abstract:Hidden-state probing – a linear classifier on a frozen vision-language model’s internal activations – has emerged as an attractive evaluation tool for flagging indirect prompt injection (IPI) in multimodal computer-use agents before the agent emits a corrupted action. We argue, on a single-backbone cautionary case study (Qwen2.5-VL-7B on Mind2Web, teacher-forced replay), that a high probing AUC on a clean-vs-attack split is not, on its own, evidence of malicious-content detection. Two post-hoc diagnostics – a paired-construction scalar baseline on text-side injections, and same-step nuisance-matched visual controls on the overlay surface – do not license an unqualified malicious-content interpretation of the headline while leaving room for partly-semantic readings. We package the diagnostics as a candidate control set with reporting heuristics for what a high clean-vs-attack AUC does and does not license. Labels are injection-surface-present, not attack success; generalisation beyond this backbone and benchmark is a conjecture.

[LG-59] RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

链接: https://arxiv.org/abs/2606.22840
作者: Haifeng Wu,Srinivasan Manoharan,Fangbo Tu,Junhua Zhao,Jian Wan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 1 figure, 9 tables

点击查看摘要

Abstract:We present RLM-Cascade, a proxy-layer system that applies speculative decoding at the response level to reduce LLM API costs without requiring model architecture access or a shared vocabulary. A fast, inexpensive draft model generates a candidate response; a capable verify model accepts, enhances, or is bypassed entirely depending on a lightweight complexity router. On a real-world agentic coding workload (Claude Code), RLM-Cascade achieves a draft-use rate of 88.8% across 125 production requests, reducing API cost by 45.8% relative to a direct Opus baseline. Counter-intuitively, the proxy also reduces end-to-end latency: median response time is 2,026 ms versus 3,698 ms for Native Opus – a 1.83X speedup at p50 – because the SKIPPED path (DeepSeek only, no Opus call) dominates the workload distribution. Quality matches or exceeds the Opus baseline: 100% pass rate on a 20-task Code/Math/Instruct benchmark versus 95% for Native Opus. We further describe a rule-based complexity router that selects the SKIPPED path for simple agentic turns and a hybrid tool-call strategy that bypasses the speculative pipeline for schema-critical tool-selection turns. RLM-Cascade is deployed in production as an enterprise AI infrastructure component and published as open source with a live metrics dashboard and Prometheus endpoint.

[LG-60] Learning-Augmented Algorithms for Online Vertex Cover

链接: https://arxiv.org/abs/2606.22831
作者: Tianhang Lu,Runtian Ren,Shengcai Liu
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies learning-augmented online weighted vertex cover with advice and a parameter \lambda \in (0,1) . We consider two graph cases: bipartite graphs and general graphs. In both settings, the online algorithm must maintain a feasible vertex cover under irrevocable decisions. We show that these problems admit the same robustness–consistency tradeoffs as learning-augmented ski rental. For the bipartite graph model, we give a randomized algorithm that is \frac11-e^-\lambda -robust and \frac\lambda1-e^-\lambda -consistent. For the general graph model, we give a deterministic algorithm that is (1+\frac1\lambda) -robust and (1+\lambda) -consistent. We prove that the tradeoffs above are optimal in both settings. We also validate the proposed algorithms through experiments on synthetic and real-world datasets.

[LG-61] BranchShine: Compact Raw-Audio-to-IPA Transcription with a RoPE E-Branchformer Encoder

链接: https://arxiv.org/abs/2606.22824
作者: Nikhil Navas,Sergio Chevtchenko,Talisson Damiao,Saeed Afshar
类目: Machine Learning (cs.LG)
*备注: 7 pages, 6 figures and 6 tables

点击查看摘要

Abstract:Speech-to-IPA transcription is useful when the desired output is pronunciation rather than orthographic text, but competitive multilingual systems are often large and evaluation is sensitive to normalization choices. This paper presents BranchShine, a 33M-parameter raw-audio CTC recognizer with a lightweight convolutional front end and a 19-block RoPE E-Branchformer encoder. We find that BranchShine provides a compact and competitive operating point for IPA transcription under matched normalization and scoring. On a 16,660-utterance multilingual test set covering 41 language labels, BranchShine obtains 9.19% whitespace-insensitive IPA character error rate, compared with 9.78% for the 575.00M-parameter PhoneticXEUS baseline. A secondary child speech reading analysis shows a complementary operating profile: BranchShine is more conservative on incorrect readings, while Whisper-Medium is stronger on exact acceptance of correct readings. Overall, the results indicate that a compact raw-audio-to-IPA model can approach much larger baselines on character-level IPA transcription.

[LG-62] Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift

链接: https://arxiv.org/abs/2606.22823
作者: Chen Liu,Bingxin Zhou,Xinyuan Wang,Ming Li,Guisheng Fan,Liang Hong
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Enzyme substrate interaction (ESI) prediction is a fundamental computational task for biocatalyst discovery and reaction screening in large biochemical spaces. In practical settings, ESI prediction is challenged by sparse positive supervision and low-homology distribution shift, where test enzymes share limited sequence identity with those observed during training. To address these challenges, we propose RAMMESI, a retrieval-augmented multimodal framework for robust ESI prediction. RAMMESI learns explicit pairwise enzyme-substrate representations through directional cross-modal interaction modeling and adaptive fusion. To enhance robustness, RAMMESI retrieves neighboring enzymes at inference time, recombines them with the query substrate, and aggregates the resulting pairwise predictions as contextual evidence. To improve learning under sparse positive supervision, we further adopt an imbalance-aware weighted-BCE objective. Experiments on two ESI benchmarks under sequence-identity-aware splits demonstrate that RAMMESI achieves consistently strong performance, with particular advantages in more challenging low-identity regimes. In addition, the retrieval module improves multiple ESI backbones in a plug-and-play manner, suggesting that retrieval provides a general mechanism for improving robustness under homology shift.

[LG-63] owards Robust Personalized Federated Learning: Vulnerability Assessment and Defense Co-Design

链接: https://arxiv.org/abs/2606.22782
作者: Mingyuan Fan,Cen Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The proliferation of IoT devices has fueled distributed edge systems to collect vast amounts of sensitive data, creating fertile ground for on-device machine learning applications. While federated learning (FL) mitigates privacy concerns by exchanging model parameters instead of raw data, we identify a critical blind spot in current research. We examine the most commonly used personalized federated learning (PFL) methods, which allow clients to maintain private, personalized models to address data heterogeneity across clients. Through systematic analysis, we reveal that PFL methods exhibit heightened vulnerability to transfer-based adversarial attacks compared to centralized learning paradigms. Wherein, malicious clients can exploit local model knowledge to craft adversarial examples that can compromise peer clients’ personalized models. We establish this vulnerability through both theoretical analysis and empirical evaluation across multiple benchmark datasets, demonstrating significant accuracy drops across various PFL methods. To address this challenge, we propose a defense framework combining stochastic input noise, input-scaled trace regularization, and parameter sensitivity maximization to improve FL’s robustness. Our findings establish the first systematic study of adversarial threats in PFL systems, providing both diagnostic tools and practical countermeasures.

[LG-64] Statistical Matching via Schrödinger Bridge beyond Conditional Independence

链接: https://arxiv.org/abs/2606.22770
作者: Eunho Koo,Tongseok Lim,Jinwon Sohn
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Statistical matching combines partially overlapping datasets that share covariates X but observe the target Y and auxiliary variables Z separately. Classical approaches typically invoke the conditional independence assumption (CIA), which makes the problem identifiable but fundamentally implies that the imported auxiliary variable provides no additional predictive power for Y once X is known. To capture this latent Y – Z dependence, we propose a novel dependency-aware Schrödinger bridge for predictive statistical matching. Our approach couples the two separated databases by tilting the conservative CIA baseline with a transportation-based compatibility cost, recovering an informative joint distribution. The resulting statistical learning framework yields full probabilistic posterior rules for bidirectional imputation. Theoretically, we establish a sufficient condition under which the learned bridge strictly improves over the CIA baseline, alongside an exact joint recovery guarantee in the Gaussian setting under an appropriate cost. Across synthetic benchmarks and real-world datasets (CelebA and Adult), we demonstrate that our dependency-aware completion consistently improves downstream predictive utility, proving especially beneficial in settings like data recoding where the underlying population exhibits strong Y – Z dependence.

[LG-65] Factored Gossip DiLoCo: Reducing Blocking Communication in DiLoCo ICML2026

链接: https://arxiv.org/abs/2606.22768
作者: Chamin Hewa Koneputugodage,Thalaiyasingam Ajanthan,Sameera Ramasinghe,Hadi Mohaghegh Dolatabadi,Shamane Siriwardhana,Gil Avraham,Violetta Shevchenko,Karol Pajak,James Snewin,Alexander Long
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at ICML 2026. 29 pages, 7 figures

点击查看摘要

Abstract:To make large-scale distributed training practical outside high-bandwidth datacenters, we must reduce blocking, high-volume synchronization. While DiLoCo communicates infrequently, its outer synchronization remains bandwidth-heavy and brittle to stragglers and transient failures. We relax exact synchronization to approximate synchronization via mixing/gossip, which degrades gracefully under delays and communication failures. This allows us to factorize DiLoCo synchronization into a non-blocking mixing step that overlaps computation with no staleness, and a blocking mixing step that tightens worker agreement, yielding a tunable trade-off between compute utilization and optimization stability. On up to billion-parameter language models in low-bandwidth settings, our framework substantially improves compute utilization compared to DiLoCo, with training progress ranging from comparable to closely matching it, and is more robust to failures.

[LG-66] One-Step Flow Matching for Generative Modeling of Path-Dependent Physical Fields

链接: https://arxiv.org/abs/2606.22752
作者: Yijing Zhou,Jasmin Jelovica
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: 25 pages

点击查看摘要

Abstract:Physical simulations for intricate geometries with path-dependent constitutive models face difficulties due to the enormous computational cost they require. Recently, the emergence of generative AI models, which succeed in image and video synthesis tasks, has provided a promise to further improve simulations. Although U-Net-based denoising diffusion probabilistic models (DDPMs) have been adopted for elastic stress field generation, they typically require hundreds of sampling steps, and applications of generative models to path-dependent, e.g. plastic, stress fields remain very limited. In this work, we propose a novel flow matching (FM) model based on a transformer backbone for high-resolution path-dependent stress field generation with stochastic loading-unloading paths and geometry. The proposed model operates within the latent space of a variational autoencoder (VAE) and formulates the simulation of plastic fields as a video synthesis task, directly generating the stress fields across all time steps. Meanwhile, we design a non-Gaussian source distribution for flow matching, such that crossings among conditional transport paths are reduced during training. This enables our model to generate satisfactory samples in one step without relying on distillation. In addition, we introduce token-level loading embeddings and two auxiliary networks to further enhance the model performance in path-dependent simulation. The results demonstrate that, even with a limited training dataset, our model can accurately generate high-resolution path-dependent fields. It is much more computationally efficient than finite element analysis, providing a speedup of 6 to 7 times over FEM on CPUs and approximately two orders of magnitude speedup on consumer-grade GPUs.

[LG-67] Error Highways: Scaling Predictive Coding to Very Deep Networks

链接: https://arxiv.org/abs/2606.22744
作者: Amirhossein Mohammadi,Alexander G. Ororbia
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Predictive coding networks (PCNs) offer a biologically-plausible, local-learning alternative to back-propagation of errors (backprop). Nevertheless, they have remained largely confined to shallow architectures and evaluated on simple machine intelligence benchmarks. A central obstacle to scaling PCNs is that the learning signal decays rapidly as it propagates away from the clamped boundaries, leaving interior layers effectively unchanged. To directly counter this problem, we propose highway error propagation (HEP), a scheme that augments the free energy function underlying predictive coding (PC) by altering its neural structure with feedback matrices V_L\to i that couple selected hidden states directly to the clamped output error. Since this coupling is linear in the hidden state, the highway pathway delivers a correction at every inference step whose magnitude is independent of depth, in contrast to vanilla PC where the output error reaches the i -th hidden layer with attenuation that decays exponentially in depth. This bypasses the Jacobian chain while preserving the local PC synaptic update rule. On MNIST and Fashion-MNIST, we show that HEP effectively trains MLPs of up to 128 layers with accuracy that is robust with respect to depth.

[LG-68] GRADE: Graph Representation of LLM Agent Dependency and Execution

链接: https://arxiv.org/abs/2606.22741
作者: Yue Zhao
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 8 tables. Code: this https URL

点击查看摘要

Abstract:Can one graph represent every kind of LLM agent’s run? A trace records what each step did, never what it relied on, the state it read, and the results it reused. GRADE recovers that missing layer: it models any run as one graph over its step nodes with two edge layers, execution edges (what ran in what order) read from the trace for free, and dependency edges (what each step relied on) rarely logged, so each is graded by how it is known, observed, declared, or inferred. One representation, and each layer earns its place. Across six corpora of LLM agents spanning tool use, coding, and the web, the dependency layer can predict failure where run size is weak and, under leave-one-corpus-out transfer, stays above chance on every held-out class while run size fails. Meanwhile, the execution layer localizes the faulting step in a failed multi-agent run. This work also provides a more in-depth analysis of why generic graph neural networks may misread the dependency layer, unlike our feature-based alternative. The same graph representation opens further uses, carrying from failure diagnosis in a single run to efficiency and robustness optimization at scale.

[LG-69] Clipping the Price of Adaptivity at the Tail

链接: https://arxiv.org/abs/2606.22669
作者: Itai Kreisler,Yair Carmon,Oliver Hinder
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adaptive stochastic convex optimization (SCO) methods face a fundamental ``price of adaptivity’’ barrier: under the standard set of assumptions, they cannot efficiently adapt to large uncertainty in both the initial distance to optimality and the Lipschitz constant. We circumvent this barrier by requiring a small amount of additional structure common to many learning problems. Specifically, we assume that the objective decomposes into a model and a loss function, enabling us to intervene by modifying the model’s output before it passes to the loss function. Under this assumption, we design a method that clips the learned model output in tail events where it deviates too much from the output of a fixed reference model. Our method matches the optimal bounds for known-parameter SCO up to logarithmic factors in the uncertainty in the distance and Lipschitz parameters, thus efficiently adapting to large uncertainty in both.

[LG-70] From Complaint Narratives to Monetary Relief: A Hybrid Machine Learning Framework for CFPB Consumer Complaints

链接: https://arxiv.org/abs/2606.22664
作者: Zhuoer Wang,Sizhen Zhu,Xiongyu Chen
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Consumer financial complaints provide a valuable source of information for identifying service failures, dispute frictions, and operational deficiencies in consumer-facing financial institutions. This paper proposes a hybrid machine learning framework for predicting monetary relief outcomes using Consumer Financial Protection Bureau complaint data. We formulate the task as an imbalanced binary classification problem, where complaints closed with monetary relief are treated as compensable outcomes. The proposed framework integrates multiple sources of predictive information, including complaint narrative text, LDA-based topic representations, interpretable text-engineered features, and structured categorical attributes such as company and state. An XGBoost classifier is trained using a temporal train-test split, with earlier complaints used for model development and more recent complaints reserved for out-of-sample evaluation. Compared with a TF-IDF baseline, the proposed framework substantially improves predictive performance, increasing AUC-ROC from 0.69 to 0.78 and improving PR-AUC under class imbalance. Feature importance analysis shows that textual signals, latent complaint topics, and company identity all contribute meaningful predictive information. In particular, company-level effects reveal systematic variation in complaint resolution patterns across financial institutions. These findings suggest that consumer complaint narratives can serve as alternative data for monitoring consumer harm, identifying firm-level operational weaknesses, and supporting early-stage risk surveillance in consumer finance.

[LG-71] LSTM Variants for Chaotic Dynamical Systems: An Empirical Study on the Lorenz Attractor

链接: https://arxiv.org/abs/2606.22662
作者: Ruslan Gokhman
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Forecasting chaotic dynamical systems such as the Lorenz attractor is notoriously difficult: small numerical errors are amplified exponentially over long autoregressive rollouts. We study seven recurrent and convolutional architectures for the AI-DEEDS 2026 Chaotic Systems Challenge: a vanilla LSTM, an LSTM with additive attention, a Bidirectional LSTM (BiLSTM), a BiLSTM trained with the Huber loss, a Temporal Convolutional Network (TCN), a CNN front-end followed by an LSTM, and a CNN front-end followed by a BiLSTM. All models share the same pre-processing, sequence length, and rollout procedure, isolating the contribution of each design choice. The challenge scores predictions on a 0-100 scale where higher is better. We obtain leaderboard scores between 45.72 and 58.81, with the BiLSTM trained with Huber loss being the strongest configuration. Two findings stand out: (i) adding additive attention to the unidirectional baseline degraded performance by over ten points, and (ii) prepending a CNN front-end to either an LSTM or a BiLSTM did not help and slightly hurt the score. Per-pair RMSE measurements confirm that the BiLSTM family generalizes better in the harder pairs (6-7), while the LSTM + Attention model collapses there (RMSE up to 8.94 on pair 6). We discuss why bidirectional context and a robust loss help in chaotic regimes while attention and CNN front-ends fail in this setting.

[LG-72] A Markov Chain Approach to Preference Alignment

链接: https://arxiv.org/abs/2606.22652
作者: Takuya Koriyama,Tengyuan Liang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 48 pages, 7 figures

点击查看摘要

Abstract:We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility U(x,y) , which quantifies human preference for y over x , and a reference probability distribution \mu_\mathsfref , we define a Markov kernel \mathsfP(x, dy)\propto \exp(U(x,y))\mu_\mathsfref(dy) , and take the Markov chain starting from \mu_\mathsfref as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm |U|\oplus=\inf_g,f\in L^\infty(\mu\mathsfref)|U-g\oplus f|\infty , which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when |U|\oplus is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment. In particular, for two natural algorithms that converge to the MCHF/NLHF equilibria, we show that the first step of MCHF and NLHF recovers the RLHF solution based on the column-sum reward \hatf(y)=\int \mu_\mathsfref(dx) U(x, y) , and starting from the second iteration, both algorithms incorporate the same linear functional of the residual U-(-\hat f)\oplus \hat f , which captures the non-transitive structure of the pairwise utility U .

[LG-73] RAVEN: Agent ic RAG for Automated Vulnerability Repair

链接: https://arxiv.org/abs/2606.22647
作者: Varun Gadey,Zijie Liu,Alexandra Dmitrienko
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 17 Pages, 4 figures. Under review

点击查看摘要

Abstract:Automated vulnerability repair has emerged as a promising direction to mitigate the growing number of software vulnerabilities. Recent advances in Large Language Models (LLMs) have further accelerated research in automated repair. However, existing frameworks remain largely restricted to memory-related vulnerabilities and locally repairable vulnerability settings, leaving generalization to unseen vulnerability types underexplored. Their evaluations are often limited to a single programming language, and largely rely on proprietary models. In this paper, we propose RAVEN, a scalable, efficient and autonomous framework that integrates an agentic retrieval-augmented generation (RAG) pipeline with controlled iterative repair in a unified framework. The framework utilizes open-source LLMs in a fully locally deployable setting with limited GPU requirements, while building a multi-faceted retrieval pipeline to retrieve historically relevant vulnerability fixes and guide the patch generation. In addition, RAVEN introduces a dedicated Curator Agent that retrieves cross-file dependencies from the target repository, to fix complex vulnerabilities that cannot be addressed using local vulnerable code alone. We evaluate RAVEN on 160 real-world CVE vulnerabilities across diverse vulnerability types, two programming languages, unseen CWE categories, and out-of-distribution settings. RAVEN achieves an overall repair success rate of 83.13%, outperforming all existing state-of-the-art repair frameworks, while also demonstrating strong generalization capabilities and maintaining the repair cost negligible.

[LG-74] Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching

链接: https://arxiv.org/abs/2606.22630
作者: Serge Thilges,Onur Celik,Denis Blessing,Emiliyan Gospodinov,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion policies have recently emerged as a powerful paradigm for representing complex action distributions in reinforcement learning (RL). However, their application to online RL remains limited by the challenge of scalable training in the absence of ground-truth data, where standard optimization techniques such as score matching are not directly applicable. In this work, we introduce a highly efficient algorithm for optimizing diffusion policies by leveraging recent advances in stochastic optimal control. Our approach is based on adjoint matching, which enables simulation-free training and circumvents the need for explicit likelihood estimation or costly backpropagation through the diffusion process. Furthermore, we propose several extensions that improve the robustness and stability of the method in practical settings. Empirical results demonstrate that our approach achieves competitive performance while significantly reducing computational overhead, making diffusion policies more viable for online RL scenarios.

[LG-75] raining-free Task Classification for Multi-Task Model Merging

链接: https://arxiv.org/abs/2606.22589
作者: Jungyong Son,Jinwook Jung,Sungyong Baik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ever since the advent of foundation models and the pre-training-finetuning paradigm, there have been numerous efforts to merge multiple task-specific experts into a single multi-task model. Prior work largely focuses on finding a single merged model, but it often underperforms individual experts due to parameter interference. To resolve this, dynamic model merging employs routing to activate task-relevant parameters per input. However, existing routers typically require either additional training with abundant labeled datasets or assume the access to task IDs of each input at inference time. In this work, we aim to close the gap to expert performance without additional training or task-ID-access assumption. To this end, we formulate routing as training-free task classification for each test input. Using singular value decomposition (SVD)-based low-rank manifold approximations for each task, SiM scores tasks by the projection residual of the test input feature onto each task manifold and routes accordingly. The task manifolds are pre-computable offline from a pretrained backbone using a small per-task support set (e.g., 32 examples per task) prior to merging process, requiring no router training and no data during the merging process. Moreover, SiM integrates seamlessly with subspace-/mask-based merging that represents task-expert via lightweight compressed task vectors, avoiding the need to store full expert parameters. Experiments across computer vision and natural language processing benchmarks under task-unknown inference demonstrate that SiM substantially improves merged-model performance and consistently narrows the gap to individual task experts.

[LG-76] Stationary Robust Mean-Field Games under Model Mismatches UAI2026

链接: https://arxiv.org/abs/2606.22579
作者: Yue Wang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Accepted by UAI 2026

点击查看摘要

Abstract:Deploying multi-agent reinforcement learning (MARL) in the real world is often limited by model mismatches between the training simulators and the true environment, which could be further amplified through strategic interactions and result in severe performance degradation upon deployment. Distributional robustness offers a principled response by optimizing policies against worst-case transition models drawn from an uncertainty set, but standard robust MARL frameworks become increasingly intractable as the number of agents grows. This paper develops an infinite-horizon, stationary mean-field game framework that incorporates distributional model uncertainty directly into the population-coupled dynamics. We establish a robust dynamic programming principle with a contractive Bellman operator and prove the existence of a stationary robust mean-field equilibrium via a fixed-point argument. We further develop the first concrete algorithm with convergence guarantees. We then connect the mean-field solution to a finite-population robust game whose ambiguity sets depend on the empirical distribution, showing that the mean-field equilibrium policy induces approximate equilibrium behavior as the population size increases. Under a contractive robust-dynamics regime, we further obtain explicit non-asymptotic error bounds. Numerical experiments further illustrate the qualitative and quantitative impact of robustness under multiple uncertainty models, validating our theoretical findings.

[LG-77] Deep material network for homogenization of piezoelectric composites

链接: https://arxiv.org/abs/2606.22566
作者: Ting-Ju Wei,Yen-Ming Lu,Chuin-Shan Chen
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Piezoelectric composites are widely used in sensors, actuators, transducers, and energy-harvesting devices because their effective electromechanical performance can be tailored by combining constituent phases and microstructural architecture. However, conventional computational homogenization based on direct numerical simulation (DNS) is computationally expensive, particularly for multiscale simulations and material design tasks that require repeated homogenization analyses. To address this limitation, this work proposes a piezoelectric deep material network (PDMN) to efficiently homogenize two-phase piezoelectric composites. The proposed framework embeds the governing electromechanical homogenization relations directly into the network architecture, yielding a physics-informed, semi-analytical surrogate that explicitly captures the two-way coupling between the mechanical and electrical fields across constituent phases. The network is trained offline on linear electroelastic datasets and, through a fully coupled Newton–Raphson solution with a consistent electromechanical tangent, subsequently used for efficient online prediction under broader constitutive settings, including nonlinear electroelasticity and history-dependent responses. The framework is validated on two-phase composites of polyvinylidene fluoride (PVDF) and lithium niobate (LiNbO _3 ) with reversed phase arrangements under nonlinear electroelastic loading, and on a viscoelastic–piezoelectric composite exhibiting coupled stress relaxation. Numerical examples show that the proposed PDMN achieves high predictive accuracy while reducing the computational cost by more than three orders of magnitude compared with DNS. The proposed framework, therefore, provides an efficient and reliable surrogate for the multiscale analysis and design of piezoelectric composites.

[LG-78] Detecting and Understanding Vulnerabilities in Fully Homomorphic Encryption Frameworks

链接: https://arxiv.org/abs/2606.22519
作者: Yiteng Peng,Dongwei Xiao,Zhibo Liu,Zhenlan JI,Shuai Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Fully homomorphic encryption (FHE) allows computations to be performed directly on encrypted data without decryption, offering strong privacy guarantees for sensitive data analysis. This capability is important for privacy-sensitive applications like secure cloud computing, finance, and healthcare. The complexity of FHE schemes, however, has hindered their practical adoption. To make FHE accessible to a broader range of developers, a new generation of specialized frameworks has emerged to translate high-level FHE programs into complex FHE operations, introducing a new programming paradigm. However, the inherent complexity of FHE frameworks makes them prone to incorrect implementation logic. Unlike mere crashes, logic bugs in these frameworks can silently corrupt encrypted computation, potentially leading to severe financial losses and security vulnerabilities in FHE-enhanced applications. In this work, we introduce HERTA, the first automated testing tool tailored for FHE frameworks. HERTA leverages metamorphic testing to uncover deep-seated implementation bugs and vulnerabilities across the multi-layered FHE software stack. To that end, we design a set of novel metamorphic relations (MRs) derived specifically from FHE semantics. These MRs stress the most challenging aspects of the pipeline, enabling automated correctness testing without the need for a manual ground truth. Our evaluation of HERTA on 3 leading industry frameworks discovered 21 previously unknown bugs, several of which have already been confirmed and fixed by developers. Furthermore, our hazard analysis reveals the critical security impact these bugs pose to the integrity and availability of FHE-based services. Comments: 16 pages, 6 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2606.22519 [cs.CR] (or arXiv:2606.22519v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.22519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-79] Federated learning with heavy-tailed gradient noise and communication noise: a variance-reduction based algorithm

链接: https://arxiv.org/abs/2606.22466
作者: Shengchao Zhao,Yongchao Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is an emerging distributed machine learning paradigm that enables local devices to jointly train a global model while keeping data decentralized and private. We propose a variance-reduction based algorithm, VRA-FedSGD, for FL in the presence of heavy-tailed gradient noise and communication noise, where these noises are prevalent in large-scale machine learning over wireless networks and Internet of Things deployments. VRA-FedSGD employs a momentum variance reduction technique together with a nonlinear mapping to mitigate heavy-tailed gradient noise, and uses a variance-reduced aggregation mechanism to suppress heavy-tailed communication noise. In the mean sense, VRA-FedSGD achieves a convergence rate of \small \mathcalO\left(K^-(p-1)/(2p-1)\right) for nonconvex objective functions, where p is the tail index of heavy-tailed noise. In the almost sure sense, VRA-FedSGD achieves a convergence rate of \tilde\mathcalO\left(K^-(1-1/(p-\epsilon))\right) for strongly convex objective functions, where \epsilon is an arbitrarily small constant. Simulated experiments on a logistic regression problem with real-world data verify the effectiveness of VRA-FedSGD.

[LG-80] Adaptive Recurrent Message Passing for Test Time Computing on Graphs ICML2026

链接: https://arxiv.org/abs/2606.22462
作者: Junshu Sun,Wanxing Chang,Qingming Huang,Shuhui Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:Pre-trained foundation models have demonstrated remarkable success in many domains, enabling a unified backbone to generalize across diverse downstream tasks. However, extending this paradigm to graph learning remains challenging due to the intrinsic mismatch between graph data and fixed architectural designs. In this work, we show that this limitation can be overcome via recurrent graph models. To achieve this, we conduct a systematic theoretical analysis, rigorously deriving step dependence as a necessary and sufficient condition for an adaptively convergent recurrent process. Building on this foundation, we propose AdaR, an Adaptive Recurrent graph model, empowering flexible test-time computing on various downstream tasks without changing model parameters. To enable adaptive inference, AdaR explicitly encodes normalized step information and representation-target relations into the recurrent updates. To ensure convergence of the recurrent process, AdaR employs gradient-based supervision signals that guide representation updates throughout the recurrence. Empirical results demonstrate that AdaR consistently outperforms strong baselines in both inductive and transductive settings.

[LG-81] Distribution-Aware Robust Bilevel Optimization: Quantile-Guided Huber Updates in Two-Timescale Stochastic Approximation

链接: https://arxiv.org/abs/2606.22436
作者: Zhiyu Li,Xi Xuan,Davide Carbone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilevel optimization (BLO) is fundamental to hierarchical decision-making but suffers from critical instability under heavy-tailed stochastic noise. Existing variance-reduction techniques typically rely on myopic magnitude checks, which fail to distinguish informative geometric signals from impulsive outliers. To resolve this, we propose \textbfRQ-TTSA (Robust Quantile-guided TTSA), a distribution-aware framework that leverages historical gradient buffers to estimate rolling quantiles for adaptive Huber-style clipping, effectively preserving local optimization geometry while strictly bounding effective variance. Theoretically, we provide a convergence analysis for quantile-guided TTSA under nonconvex-strongly convex assumptions with infinite-variance noise ( p \in (1,2] ), deriving a rate of \mathcalO(T^-\fracp-13p-2) that recovers optimal dependence on the heavy-tailed parameter. Empirically, across six diverse tasks, spanning heterogeneous vision benchmarks, dynamic games under momentum poisoning, and offline reinforcement learning, RQ-TTSA consistently outperforms state-of-the-art baselines by eliminating divergence spikes and ensuring stable convergence. Our method demonstrates significant robustness to hyperparameter variations and incurs negligible computational overhead ( \approx 2.7% increase), validating distribution-aware gradient control as a practical and necessary component for reliable bilevel learning.

[LG-82] Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

链接: https://arxiv.org/abs/2606.22433
作者: Zhiyu Li,Xi Xuan,Davide Carbone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many central machine learning tasks, from entropy tuning in reinforcement learning to equilibrating generative adversarial networks, are fundamentally stochastic root-finding problems rather than loss minimization. Yet, they are frequently forced into a minimization framework via squared residuals, introducing a critical flaw we identify as the Variance Trap. Standard bilevel minimization algorithms require estimating hypergradients involving implicit Jacobians; in stochastic settings, these terms act as noise amplifiers, destabilizing convergence. We formalize Root-Finding Bilevel Optimization (RF-BO) as a distinct problem class that bypasses this pathology. We propose a Jacobian-free solution using Two-Time-Scale Stochastic Approximation (TTSA) that updates directly along the root error, structurally avoiding variance amplification. We provide the first non-asymptotic convergence guarantees for TTSA in this setting under Markovian noise. Extensive experiments demonstrate the decisive advantage of this paradigm: compared to squared-residual and implicit-gradient baselines, our framework achieves a 2.6% top-1 accuracy gain in SimCLR, 17 \times faster convergence in non-linear ODE control where baselines fail, significantly improved entropy stability in reinforcement learning, and an 11.1% quality improvement in generative modeling.

[LG-83] Enhancing LLM s for Graph Tasks via Graph-aware LoRA Generation ICML2026

链接: https://arxiv.org/abs/2606.22429
作者: Junshu Sun,Wanxing Chang,Qingming Huang,Shuhui Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:Graph neural networks (GNNs) tightly couple their input-output parameters to dataset-specific feature spaces and target sets, exhibiting limited transferability across different datasets. In contrast, language models (LMs) generalize flexibly via a unified input-output interface, motivating recent attempts to adapt LMs to graph tasks. However, existing methods struggle to encode whole-graph information, leading to potential information loss and suboptimal graph understanding. In this work, we propose a novel weight-level information injection paradigm for adapting LMs to graph tasks. This paradigm injects whole-graph information by generating task-specific weight updates that interact directly with hidden representations. Instantiating this paradigm following low-rank adaptation (LoRA), we introduce GaRA, a Graph-aware LoRA generation model. GaRA constructs low-rank weight updates conditioned on the original graph structures and constrains the norm of the generated updates, thus injecting whole-graph information and avoiding the optimization bias in the weight generation. Empirical studies demonstrate that GaRA consistently outperforms baselines on zero-shot graph learning tasks.

[LG-84] QeHDC: Hyperdimensional Computing based on Quantum-enhanced binding and SuperClass Construction

链接: https://arxiv.org/abs/2606.22421
作者: Yangjie Xu,Hui Huang,Li Ning,Radu State
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperdimensional Computing (HDC) is a robust computational framework inspired by human cognition characterized by simple and efficient operations within high-dimensional vector spaces. Quantum-enhanced Hyperdimensional Computing (QeHDC) extends classical HDC by leveraging quantum mechanical properties to enhance computational efficiency. In this paper, we propose a novel Quantum HDC framework featuring a one-pass training method, leveraging sinusoidal and quantum encoding to project classical data into quantum amplitude states efficiently. Our framework introduces an innovative reference-state-based quantum binding operation realized via quantum circuits. Furthermore, we propose a density-matrix-based superclass generation strategy employing eigenvalue decomposition to extract critical quantum state features effectively, enabling a more accurate and robust class representation. Experimental evaluations conducted on standard benchmark datasets demonstrate our approach’s superior performance, robustness to noise, and computational feasibility compared to traditional classical and existing quantum-enhanced approaches. The results highlight the practical benefits and potential of Quantum HDC for quantum-enhanced classification tasks and pave the way for future advancements in quantum-inspired computational paradigms.

[LG-85] Asymptotic Signal Subspace Recovery in Softmax Attention Models

链接: https://arxiv.org/abs/2606.22406
作者: Lan V. Truong
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 27 pages, 3 figures

点击查看摘要

Abstract:Attention mechanisms have demonstrated remarkable empirical success in identifying relevant information from large collections of tokens, yet the theoretical principles underlying this behavior remain poorly understood. We study a stylized softmax-attention model in which a query vector is learned by stochastic gradient ascent from a collection of informative and nuisance tokens. Exploiting the symmetry of the model, we derive a population objective and characterize the limiting ordinary differential equation governing the learning dynamics. Using tools from stochastic approximation and dynamical systems theory, we establish a rigorous connection between the stochastic learning algorithm and its deterministic limit. Our main result shows that, under suitable high-dimensional scaling assumptions and standard step-size conditions, the learned query converges almost surely to the one-dimensional signal subspace spanned by the latent informative direction. Equivalently, the query asymptotically recovers the latent signal up to the intrinsic sign ambiguity. These results provide a rigorous theoretical foundation for understanding attention mechanisms as signal extraction procedures in high-dimensional noisy environments and offer a dynamical-systems perspective on how attention discovers relevant information in the presence of substantial noise.

[LG-86] Bypassing Minimization Bias: A Shift-Invariant Variance Estimator for Off-Equilibrium Local Learning Coefficients

链接: https://arxiv.org/abs/2606.22389
作者: Yingjia Cai
类目: Machine Learning (cs.LG)
*备注: 27 pages, 1 figure, 6 tables. Code is available at this https URL

点击查看摘要

Abstract:Singular Learning Theory leverages the Local Learning Coefficient (LLC) to quantify the geometry of neural network loss landscapes. However, mean-energy LLC estimators depend explicitly on an additive loss baseline, typically an estimate of the local minimum. During transient, off-equilibrium training phases, this minimum is unknown; substituting it with the lowest noisy mini-batch loss induces a systematic minimization bias that distorts the geometric measurement. In this paper, we propose the Shift-Invariant Variance Estimator (SIVE), a variance-based local LLC probe that structurally eliminates the unknown additive baseline through the variance operator. Combining this shift-invariant observable with an explicit correction derived from the Law of Total Variance, SIVE separates geometric loss fluctuations from mini-batch evaluation noise. Controlled experiments on analytically tractable toy models show that SIVE recovers the expected finite-temperature geometric signal in regimes where anchored mean estimators fail. Applied to deep neural networks, SIVE provides a robust, localized online diagnostic for tracking structural phase transitions throughout training.

[LG-87] Multigrid Training for Molecular Generation using Graph Neural Networks

链接: https://arxiv.org/abs/2606.22377
作者: Zixuan Ling,Paula Mercurio,Di Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 29 pages, 12 figures

点击查看摘要

Abstract:Deep learning has demonstrated significant success for modeling biochemical molecular systems, where inputs are commonly represented as graphs or 3D grids. A major challenge is that computational cost scales with resolution, making full graph/grid computation of molecular densities expensive and often unstable. We introduce a multigrid training strategy that leverages low-resolution optimization to accelerate learning at higher resolution through parameter transfer across discretizations. For graph molecular representations, we progressively transfer parameters learned from a coarse graph to a sequence of increasingly finer graphs via biased random walk upsampling. For 3D molecular generation, we voxelize the molecular structures at multiple resolutions, pretrain a coarse-resolution conditional Variational Autoencoder (CVAE), and initialize a fine-resolution CVAE by transferring shape compatible convolutional parameters from the coarse model. Numerical experiments on receptor-conditioned 3D Ligand generation show that multigrid training accelerates convergence and improves generalization compared to training from scratch.

[LG-88] Kiwano: A Cutting-Edge Open-Source Toolkit for Speaker Verification

链接: https://arxiv.org/abs/2606.22369
作者: Mickael Rouvier,Pierre Michel Bousquet
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present Kiwano, an open-source toolkit designed to advance research and evaluation for speaker verification. Kiwano provides a lightweight yet extensible framework built on PyTorch, offering standardized recipes, pretrained models, and integration of several widely used speaker verification architectures. The toolkit emphasizes reproducibility, by delivering transparent training pipelines, unified evaluation protocols and ready-to-use baselines across multiple corpora. Beyond conventional training and inference, Kiwano includes tools for benchmarking, experiment tracking and rapid prototyping of new architectures. To foster community adoption, the toolkit is distributed under the Apache 2.0 license, accompanied by comprehensive documentation and reproducible experiments. By lowering entry barriers and standardizing evaluation practices, Kiwano contributes a valuable resource for both academic research and applied development in speaker verification. The toolkit is publicly available at: this https URL

[LG-89] Encoder-Decoder Manifold Alignment for Idempotent Generation

链接: https://arxiv.org/abs/2606.22304
作者: Dareen Alharthi,Abdul Waheed,Bhiksha Raj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, several learning paradigms have been introduced to enforce idempotency in generative models. The goal is to ensure that repeated application of a model leaves samples unchanged once they lie on the target data manifold. In practice, however, many of these approaches fail to achieve exact fixed points, leading to instability and drift under repeated application. In this work, we argue that a key reason for this failure is a geometric mismatch between the manifolds learned by the encoder and decoder. The encoder projects inputs onto one latent manifold, while the decoder implicitly learns to reconstruct data from a different manifold. This discrepancy prevents the model from learning truly idempotent mappings. To address this issue, we propose a new training framework that explicitly closes this gap by forcing the encoder and decoder to learn consistent representations of the same underlying data manifold. By aligning the geometry of these components, our method encourages stable projections. Empirically, we show that our approach achieves significantly lower idempotency error and consistently regenerates identical outputs under repeated application, compared to existing methods. We demonstrate the effectiveness of the proposed framework on both image generation and image editing tasks. Finally, we show that enforcing idempotency in this manner improves identity preservation and information stability, leading to more realistic and controllable generative editing models.

[LG-90] Any-Body Guard: Universal Safeguarding for Manipulation Policies via Action Masking

链接: https://arxiv.org/abs/2606.22278
作者: Alex Beaudin,Hanna Krasowski,Kartik Nagpal,Sanjit A. Seshia,Murat Arcak,Negar Mehr
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Ensuring safety of learning-enabled robotic manipulation across diverse embodiments and tasks still requires significant manual engineering. Existing approaches typically rely on heuristically designed fallback controllers or complex forward invariance assessments. These methods are often too conservative for task success, too computationally expensive for real-time execution, too heuristic to provide useful safety guarantees, or too engineering-heavy to transfer between setups. In this paper, we propose a universal safeguarding approach, X-Safe, which reasons directly in the robot’s configuration space to provide formal probabilistic guarantees for collision avoidance. By operating in the configuration space, our method transfers across embodiments while relying solely on an object-based, quasi-static scene representation and a forward kinematics model of the robotic manipulator. Thus, X-Safe provides useful formal safety guarantees without requiring additional data, or engineering effort for different embodiments or scenes. We demonstrate X-Safe for diverse embodiments and policies, both in simulation and on hardware. We observe less degradation in task performance compared to state-of-the-art safeguarding, no collisions on hardware experiments, and empirically corroborate our formal guarantees.

[LG-91] Learning a Normal World Model for Few-Shot Boundary-Calibrated Abnormality Detection

链接: https://arxiv.org/abs/2606.22261
作者: Weizhi Nie,Weichao Liu,Weijie Wang,Yuting Su
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 23 pages,8 figures, 10 tables

点击查看摘要

Abstract:Abnormality detection in complex systems faces two practical barriers: abnormal labels are scarce, and binary labels do not quantify how far an event has departed from normal behavior. We study a normal-world modeling formulation for this setting. Instead of learning a large and incomplete space of abnormal classes, the model learns the normal world from abundant normal events and uses a few abnormal examples only to calibrate the boundary of normality. We instantiate this idea as a Hypergraph Entropic Normal-World Model. The model represents multivariate sensor windows as context-conditioned hypergraphs, where hyperedges capture high-order relations among groups of variables. It then defines abnormality by an entropy-aware normal-world energy that combines temporal prediction surprise, hypergraph consistency surprise, and latent normal-manifold departure. On the NASA C-MAPSS turbofan degradation benchmark, the proposed full energy achieves strong zero-shot and few-shot performance across all four subsets and reaches AUROC 0.9983 on FD004, the most complex setting with multiple operating conditions and fault modes. Beyond standard detection metrics, we introduce mechanistic validation tests to probe whether the energy encodes normal-world structure rather than a superficial input-output mapping. The learned energy accepts unseen healthy engines, increases along degradation trajectories, and sharply penalizes context-mismatched cross-variable coupling breaks. These results suggest that normal-world energy can serve as an anomaly score, a graded risk measure, and a testable representation of normal system behavior under severe abnormal-label scarcity.

[LG-92] Evolving Spatial Weights for Cartographic Synthesis

链接: https://arxiv.org/abs/2606.22252
作者: Gesiel R. Lopes,Roberto F. da Silva,Mellina Yamamura,Sergio H. V. L. de Mattos,Antonio M. Saraiva,Alexandre C. B. Delbem,Eric K. Tokuda
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The integration of multiple thematic data layers into a single composite map, known as the cartographic synthesis problem, is typically addressed through expert-driven weighting schemes. This study presents a multi-objective formulation of cartographic synthesis grounded in spatial autocorrelation structure. We develop a bi-objective evolutionary framework, GIS-moGA, that estimates layer weights by simultaneously maximizing global spatial structure, measured by Global Moran’s I, and minimizing local spatial heterogeneity, measured by the variance of Local Indicators of Spatial Association (LISA). Because naive evaluation of spatial relationships requires O(N^2) operations, direct computation becomes impractical for larger datasets. We address this challenge by exploiting the 97.7% sparsity of queen contiguity matrices, reducing effective complexity to O(N k) and enabling scalable municipal-level analysis. The framework is evaluated on a high-dimensional spatial epidemiology dataset with N = 523 units from Araraquara, Brazil. A 64-scenario experimental design is used to examine evolutionary behavior across parameter settings. Results show that higher mutation rates are important for maintaining population diversity and preventing premature convergence in spatially autocorrelated fitness landscapes, where crossover operators can disrupt geographically coherent structures. Compared with expert-derived Analytic Hierarchy Process baselines, the resulting Pareto fronts show substantial hypervolume gains and significant improvements in spatial coherence (p 0.001, Cliff’s delta = 0.87). These findings provide a systematic and scalable framework for data-driven geographic multi-criteria decision analysis.

[LG-93] Bayesian Adaptation Gym: A Benchmark for the Bayesian Low-Rank Adaptation of Multi-Modal Language Models UAI2026

链接: https://arxiv.org/abs/2606.22188
作者: Colin Samplawski,Ramneet Kaur,Manoj Acharya,Anirban Roy,Adam D. Cobb
类目: Machine Learning (cs.LG)
*备注: Oral Paper at UAI 2026

点击查看摘要

Abstract:Large multi-modal language models are increasingly deployed in high-stakes domains, making well-calibrated uncertainty essential. Traditional Bayesian methods approximate posteriors over all model weights, which becomes intractable for modern large models. For this reason, recent work instead considers Bayesian low-rank adaptation to enable tractable posterior approximation. Due to a lack of a standardized benchmark to evaluate these approaches, it remains unclear where these methods provide meaningful benefits. To fill this gap, we introduce Bayesian Adaptation Gym (BAG), a benchmark for the Bayesian adaptation of multi-modal language models. BAG provides reference implementations of classic Bayesian baselines and state-of-the-art adaptation methods, along with a multi-modal dataset and task suite designed to probe calibration, robustness under distribution shift, and decision-making under uncertainty via active learning. Using BAG, we conduct and report extensive experiments across model sizes, datasets, and tasks to highlight the successes and failures of current Bayesian adaptation approaches. To enable further research, BAG is fully open source: this https URL.

[LG-94] Residue-Level Attributions in Protein Language Models Do Not Recover Allergen Epitopes ICML2026

链接: https://arxiv.org/abs/2606.22181
作者: Jianzhou Yao(1 and 2),Anxiong Song(1 and 2),Katja Baerenfaller(1 and 3),Damir Zhakparov(1 and 3) ((1) Swiss Institute of Allergy and Asthma Research, Davos, Switzerland, (2) ETH Zurich, Zurich, Switzerland, (3) Swiss Institute of Bioinformatics, Lausanne, Switzerland)
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICML 2026 Mechanistic Interpretability Workshop (peer-reviewed)

点击查看摘要

Abstract:Deep allergenicity classifiers are increasingly used in safety screening of novel foods, and recent protein language models have substantially improved protein-level allergenicity prediction. However, whether their explanations capture biologically meaningful information remains unclear. We introduce an epitope-grounded residue-level benchmark for quantitatively evaluating attribution faithfulness in protein allergenicity models. Across frozen ESM-2, multi-task ESM-2, and DeepPlantAllergy, protein-level classification was robust, yet classification-head explanation signals did not significantly exceed random in their residue-level alignment with annotated epitopes across AUROC, AUPRC, and Precision@k. Integrated Gradients identified residues that were functionally important to the model, but not overlapping annotated epitopes. Saturation mutagenesis further suggested classifiers may rely on physicochemical and compositional sequence features rather than epitope-specific mechanisms. Residue-level importance signals should therefore not be interpreted as immunological explanations for safety screening or hypoallergen design without quantitative validation. Code available: this https URL

[LG-95] FeLoG: Scalable and Efficient Distributed Graph Embedding with Feedback Loop Mechanism

链接: https://arxiv.org/abs/2606.22180
作者: Peng Fang,Arijit Khan,Ziqiang Wu,Zhenli Li,Yibo Zhou,Fang Wang,Dan Feng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph embedding maps graph nodes into low-dimensional vectors to support applications such as recommendation, fraud detection, and graph-based retrieval-augmented generation (GraphRAG). As graphs scale to billions of edges, scalable and efficient graph embedding has become increasingly important. Existing frameworks commonly adopt a sampling-training paradigm, in which mini-batches are constructed by sampling nodes and their neighbors. However, sampling is typically decoupled from evolving embedding quality, causing redundant exploration of well-trained regions while under-sampling undertrained nodes. At the system level, such decoupling further leads to excessive communication, serialized execution, and low resource utilization in distributed environments. We present FeLoG, a feedback loop-driven system for scalable distributed graph embedding. (1) FeLoG introduces feedback-coupled sampling and training, dynamically prioritizing undertrained nodes according to real-time embedding-quality feedback, thereby reducing redundant computation and accelerating convergence. (2) It employs activity-aware communication that compresses frequently occurring node sequences to reduce intra-machine PCIe traffic and selectively synchronizes frequently updated embeddings to reduce inter-machine communication. (3) It adopts a round-interleaved pipeline that overlaps next-round sampling with current-round training to improve CPU-GPU utilization. Experiments against six state-of-the-art baselines on large-scale graphs show that FeLoG achieves an average speedup of 27.9x, reduces communication cost by more than 53.1%, and sustains over 80% CPU-GPU utilization.

[LG-96] Beyond Time Series: Spatial Reasoning for Epidemic Forecasting via Multimodal Learning KDD2026 KDD

链接: https://arxiv.org/abs/2606.22171
作者: Diana Guadalupe Gomez,Chenwei Wu,Zhiyi Wang,Liyue Shen,Alexander Rodríguez
类目: Machine Learning (cs.LG)
*备注: To appear in the Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026), AI for Science Track

点击查看摘要

Abstract:Epidemic forecasting models typically rely on surveillance data reported over administrative regions, treating them as atomic units, thereby obscuring sub-regional spatial structure that shapes disease dynamics. We introduce a spatially structured multimodal epidemic forecasting setting that integrates region-level temporal surveillance data with spatially localized auxiliary signals that are misaligned in resolution and structure, reflecting realistic public health reporting constraints. Building on this formulation, we propose M-SPICE (Multimodal SPatIal Context for Epidemic Forecasting), a structure-aware spatiotemporal forecasting framework that performs joint reasoning over temporal disease dynamics and spatial context via attention-based multimodal fusion, allowing spatial signals to selectively condition temporal representations across forecast horizons. We evaluate our approach on real-world COVID-19, influenza, and influenza-like illness (ILI) forecasting tasks under realistic real-time evaluation protocols. Across all forecasting settings, our method consistently outperforms state-of-the-art multivariate time-series, multimodal, and epidemiological forecasting baselines while maintaining strong probabilistic forecasting performance. Finally, interpretability analyses reveal when, where, and how spatial signals are leveraged, highlighting settings in which purely temporal, region-aggregated models are most likely to fail.

[LG-97] Early-Exit Graph Neural Networks for Link Prediction

链接: https://arxiv.org/abs/2606.22167
作者: Roman Knyazhitskiy,Andrea Giuseppe Di Francesco
类目: Machine Learning (cs.LG)
*备注: Accepted at LoG@Pisa 2026

点击查看摘要

Abstract:Graph Neural Networks are great for link prediction in various network-like structures; however, the question of their speed/quality tradeoff has been barely studied. While in practice the time it takes to do inference matters little for small benchmarks, the latency does limit applicability in large-scale domains. In this work, we explore early-exiting strategies that can be applied to Graph Neural Networks to solve the problem of link-prediction faster. We use no auxiliary losses to enforce early exiting, allowing it to emerge as an implicit property of the architecture. We show that our method enables early exiting in several setups, moving the Pareto frontier on the HeaRT benchmark for GCN and SAS-GNN backbones. Our findings show that inference speed of GNNs on many link-prediction problems can be improved, while losing little, or even winning in terms of prediction quality. The code is available in our repository: this https URL.

[LG-98] Drowning in Routine: Signal Dilution in Multi-Turn Agent Training ICML2026

链接: https://arxiv.org/abs/2606.22164
作者: Yann Pernot(1 and 2),Vi Retault(2) ((1) Mila - Québec AI Institute, (2) Polytechnique Montréal)
类目: Machine Learning (cs.LG)
*备注: Accepted at the FAGEN Workshop at ICML 2026, Seoul, South Korea. 14 pages, 9 figures

点击查看摘要

Abstract:Multi-turn agents interleave consequential decisions with routine execution: some actions change the downstream return distribution, while others are necessary but reward-equivalent. The cost of trajectory-level credit assignment, often attributed to long horizons, is in fact governed by decision density \rho : the fraction of turns whose actions affect the return. When decision density is low, routine turns create signal dilution: they add gradient variance to trajectory-level estimators such as GRPO without adding expected signal. Under explicit assumptions, the resulting turn-level to trajectory-level signal-to-noise ratio scales as \rho^-1/2 , provided critic error remains controlled. The same analysis identifies the complementary regime: at high decision density, trajectory-level methods can remain competitive while avoiding the cost of a critic. In a controlled environment where \rho is exactly tunable, the predicted scaling is recovered with R^2 = 0.999 , and the training-step gap widens significantly as \rho \to 0 .

[LG-99] Parameterized Representations via Implicit Stochastic Modulation for High-Dimensional and High-Order Neural PDE Solvers

链接: https://arxiv.org/abs/2606.22150
作者: Zhangyong Liang,Huanhuan Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving high-dimensional and high-order PDEs is challenged by the coupled growth of spatial dimensionality and derivative order. Recent stochastic derivative estimators reduce this cost by replacing full derivative tensors with randomized dimension or Taylor estimators, but they are mostly designed for fixed physical parameters and require retraining for each new parameter. We show that direct conditional parameterization of such solvers entangles physical parameters with the high-order automatic differentiation graph, causing extra memory growth and parameter-induced variance amplification. We propose Parameterized Representations via Implicit Stochastic Modulation (PRISM), a plug-and-play framework for parameterized high-dimensional and high-order stochastic neural PDE solvers. PRISM uses a hyper-generator to map physical parameters to affine modulators that scale and shift a purely spatial latent manifold, while keeping parameter branches value-connected but spatial-tangent-disconnected. This design preserves unbiased stochastic dimension and Taylor estimators, removes the parameter encoder from high-order spatial AD, and provides a variance-aware Lipschitz envelope over the parameter space. We prove parameterized unbiasedness, estimation-error bounds, and convergence under bounded stochastic variance. Experiments with PRISM-STDE and PRISM-SDGD on nonlinear parameterized PDEs show stable zero-shot generalization, reduced memory usage, and scalability up to 100,000 dimensions on a single GPU, with efficient low-rank SVD adaptation for unseen parameters.

[LG-100] Meta-Reinforcement Learning via Evolution for Multi-Objective Combinatorial Supply Chain Optimisation

链接: https://arxiv.org/abs/2606.22146
作者: Rifny Rachman,Bahrul Ilmi Nasution,Josh Tingey,Richard Allmendinger,Pradyumn Shukla,Wei Pan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta-reinforcement learning is a promising approach to multi-objective optimisation because it enables rapid policy adaptation across changing environments and preference settings. However, conventional few-shot methods usually fine-tune from a single shared meta-policy, which can reduce solution diversity and limit exploration of the Pareto front, especially in high-dimensional combinatorial problems such as supply chain optimisation. We propose a population-based Meta-reinforcement learning framework that combines decomposition with evolutionary search in scalarisation weight space. The framework maintains a population of weight vectors, each associated with a distinct meta-policy trained through gradient-based meta-learning, and iteratively refines this population through elitist selection, crossover, and mutation guided by hypervolume and entropy contributions. We evaluate the method in a multi-objective supply chain setting with conflicting economic, environmental, and social goals, and further test its generality on standard reinforcement learning problems. The results show that the proposed approach yields more diverse, better distributed Pareto front approximations, improves cross-task adaptation, increases hypervolume by up to 32% over Meta-multi-objective reinforcement learning in the complex case, and attains the lowest average Hausdorff distance among all compared methods.

[LG-101] Physics-Informed Eikonal Caging for Whole-Arm Manipulation Planning

链接: https://arxiv.org/abs/2606.22143
作者: Yan Zhang,Yiming Li,Yifei Dong,Florian T. Pokorny,Sylvain Calinon
类目: Robotics (cs.RO); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Planning contact-rich whole-arm manipulation is challenging because interactions that involve extended robot geometry give rise to complex contact dynamics that are difficult to model accurately. This creates a need for planning principles that do not rely heavily on precise contact models. Caging offers one such geometric notion of robustness to modeling inaccuracy by restricting object escape through geometrically enclosing the object. However, existing caging formulations are difficult to incorporate into continuous optimization-based manipulation planning. We reformulate caging as a minimum-time escape problem in which the object seeks to leave an enclosing robot geometry in the shortest time. This yields a continuous escape-time field that measures the robot’s enclosure quality and we show it satisfies an eikonal equation. We therefore can approximate this field using a physics-informed neural network, producing a smooth differentiable representation that can be embedded directly into manipulation planning. The resulting objective supports whole-arm manipulation planning to favor robot configurations resisting object escape. This improves the manipulation robustness to contact model mismatch, thus enabling planning with simplified contact models, including quasi-dynamic approximations and simplified object geometry. Across simulation and real-world experiments, we show improved robustness to disturbances and contact-model mismatch relative to baselines. These results suggest that geometric enclosure can serve as a practical robustness primitive for whole-arm manipulation. A supplementary video, which includes an intuitive overview of our method and experiment video results, is available on our project webpage.

[LG-102] Reinforcement Learning-Based Traffic Signal Control for IoT-Enabled Intersections

链接: https://arxiv.org/abs/2606.22108
作者: Yousef AlSaqabi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, submitted to IEEE Open Journal of Intelligent Transportation Systems

点击查看摘要

Abstract:Urban traffic congestion remains a persistent challenge in car-dependent cities, imposing significant economic and societal costs. Traffic signal systems are increasingly deployed as networked cyber-physical components within smart-city infrastructures, where distributed sensing and edge intelligence enable adaptive traffic management. This paper investigates reinforcement learning (RL) as an edge-intelligent approach for adaptive traffic signal operation at a signalized urban intersection in Kuwait. A Proximal Policy Optimization (PPO)-based controller is developed to dynamically allocate green-phase durations using locally observed traffic states, without relying on future demand information or centralized coordination. The controller is evaluated in a realistic simulation environment informed by real-world hourly traffic volume data from Kuwait, and is compared against both conventional fixed-time control and a vehicle-actuated controller representing the current state of practice, using average vehicle delay, queue length, and emissions as performance metrics. Under nominal conditions, the proposed controller reduces average vehicle delay by 46% relative to fixed-time control and 34% relative to actuated control, while also lowering per-vehicle CO2 emissions by approximately 23%. These performance gains persist under demand perturbations of +/-15%, generalize from weekday to weekend traffic patterns, and are corroborated by a reward function ablation; low variance across five random seeds confirms their statistical reliability. These findings demonstrate the practicality of learning-based edge traffic signal control as a building block for IoT-enabled smart-city transportation systems, and as a deployable precursor toward fully connected, Internet of Vehicles (IoV)-based urban mobility.

[LG-103] Frequency-Domain Neural ODEs for Modeling Non-Linear Dynamical Systems

链接: https://arxiv.org/abs/2606.22075
作者: Mohammed Ashraf,Ayman A. El-Badawy
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 11 pages. figures, tables, and results are at the bottom of the file

点击查看摘要

Abstract:Standard continuous-depth models, such as Neural Ordinary Differential Equations (NODEs), offer significant advantages in modeling physical systems by learning continuous vector fields rather than discrete temporal steps. However, when applied to complex dynamical systems, standard NODEs frequently struggle with highly nonlinear dynamics. This paper investigates the Frequency-domain Neural ODE (FNODE), an architecture that projects continuous temporal dynamics into the frequency domain using the Fast Fourier Transform (FFT). By operating in the frequency domain, the model provides better generalization to the dynamical system. The architecture is empirically evaluated against discrete models, specifically Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs), and other continuous-depth variants, including Augmented Neural ODE (ANODE), across four distinct dynamical systems: the Lotka-Volterra model, the forced Duffing oscillator, the Van der Pol oscillator, and the Lorenz system. To rigorously assess generalization and robustness, curriculum and ensemble learning are used to evaluate the model’s convergence by estimating confidence intervals across different ensemble models. The empirical results demonstrate that the FNODE architecture achieves better generalization while exhibiting remarkable convergence stability.

[LG-104] How Should a Simulation-to-Reality Transfer Budget Be Spent?

链接: https://arxiv.org/abs/2606.22062
作者: Syed Hamzah Rizvi,Yash Vardhan Tomar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Both authors contributed equally and share first authorship

点击查看摘要

Abstract:Simulation-to-reality transfer, often called sim-to-real transfer, is a central challenge in robot learning. Yet, the tradeoff between measuring a system more accurately and training over a broader range of simulated dynamics is still poorly understood. In this work, we focused on the allocation of real-robot measurement time between system identification and domain randomization. We studied this tradeoff in a controlled sim-to-sim pendulum setting, where a hidden-parameter model stands in for the physical robot, and the experiment sweeps identification rollouts against the width of the randomization distribution. Across the reality gaps and noise levels we tested, the measurement budget did most of the work. A small number of identification rollouts closed most of the transfer gap, and once any real data was available, policies performed best when trained at the estimated parameters rather than over a widened randomization band. Broad randomization that contained the true system still did not substitute for measurement. These results hold in a benign regime where the dynamics are identifiable and only two parameters are unknown, so structural model mismatch remains the setting where randomization breadth may become more valuable. Overall, our results suggest that sim-to-real pipelines should first measure the parameters they can and reserve randomization for the uncertainty that remains.

[LG-105] Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning ICML2026

链接: https://arxiv.org/abs/2606.22056
作者: Tian Xu,Zexuan Chen,Zhilong Zhang,Yi-Chen Li,Chenyang Wang,Lei Yuan,Yang Yu
类目: Machine Learning (cs.LG)
*备注: Paper accepted by ICML 2026

点击查看摘要

Abstract:Adversarial imitation learning (AIL) achieves high-quality imitation compared to behavioral cloning (BC), but demands substantial online environment interaction. Recent empirical work has explored initializing AIL algorithms with BC pretrained policies to address this limitation, yet a rigorous theoretical understanding of pretraining’s role in AIL remains elusive. This paper provides a systematic theoretical analysis and introduces principled pretraining algorithms for accelerating AIL. We begin by analyzing AIL with policy pretraining alone, identifying reward error as the dominant source of suboptimality. This reveals a critical and previously overlooked gap: the absence of reward pretraining. Motivated by this finding, we develop a principled policy-reward co-pretraining approach grounded in a reward shaping analysis. Our analysis uncovers a fundamental connection between expert policies and shaping rewards, which naturally gives rise to CoPT-AIL, an approach that jointly pretrains both policy and reward through a single BC procedure. We prove that CoPT-AIL achieves an improved imitation gap bound over standard AIL, establishing the first theoretical guarantee for the benefits of pretraining in AIL. Experimental results confirm CoPT-AIL’s superior performance over existing AIL methods.

[LG-106] What Do Neural Networks Learn for TDOA Estimation? A Cross-Architecture Probing Study INTERSPEECH2026

链接: https://arxiv.org/abs/2606.22020
作者: Yaozhong Kang,Jiang Wang,Runwu Shi,Takeshi Ashizawa,Benjamin Yen,Kazuhiro Nakadai
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, 2 tables. Accepted to Interspeech 2026. Code: this https URL

点击查看摘要

Abstract:Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT’s mathematical steps into diagnostic targets, probing hidden layers of three architectures (MLP, CNN, Transformer) and complementing with gradient attribution and causal frequency masking. We find that cross-power computation consistently emerges across all architectures and conditions, while PHAT whitening, the defining step of GCC-PHAT, fails to emerge. Instead, networks learn a magnitude-aware frequency weighting that preserves per-frequency reliability information discarded by PHAT. This makes PHAT an information bottleneck: removing it from both classical and neural GCC pipelines improves performance under additive noise. On real-world reverberant data, PHAT remains the best classical weighting, but end-to-end networks achieve lower error by learning data-adaptive weighting.

[LG-107] Load Testing for Machine Learning Model Serving Systems at Scale

链接: https://arxiv.org/abs/2606.22013
作者: Amr S. Abdelfattah,Nakul Tirumalai,Indu Mohanan,Xiao Li,Pengchao Wang,Dinakar Dhurjati,Eric Sung
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: The paper is accepted at ECSA 2026

点击查看摘要

Abstract:Machine learning (ML) model serving has become a dominant consumer of GPU infrastructure, yet capacity planning in these systems remains largely ad hoc. Under-provisioning leads to service-level objective (SLO) violations and production incidents, while over-provisioning results in substantial resource waste. This paper presents \sys, an industrial load testing framework for ML serving systems that systematically estimates serving capacity through an adaptive, feedback-driven search strategy. The approach leverages real-time performance signals, incorporating dampening, spike tolerance, and convergence detection to efficiently identify maximum sustainable throughput under SLO constraints. We evaluate \sys through a longitudinal analysis of 14 industrial case studies spanning four ML architecture classes: recommendation, ranking, vision, and NLP. This study demonstrates that systematic load testing leads to substantial improvements in GPU resource efficiency and operational reliability. Prior to adopting \sys, a significant fraction of model launches were under-provisioned, resulting in recurring incidents; these issues were substantially reduced after deployment. Our results show that ML-specific design decisions are critical to accurate capacity estimation: workload calibration using recorded traffic reduces estimation error from approximately 30% to 2–6%, while proper warmup handling yields a 22.2% improvement in accuracy. Further analysis reveals key factors influencing prediction error, including model size and co-location effects. This paper distills six lessons and derive architectural guidelines for ML load testing, offering actionable insights for building reliable and efficient ML serving systems.

[LG-108] Prefix-Guided On-Policy Distillation: Mining Golden Trajectories from Rollouts

链接: https://arxiv.org/abs/2606.21994
作者: Qingfei Zhao,Huan Song,Shuyu Tian,Jiawei Shao,Xuelong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:On-policy distillation (OPD) improves reasoning models by applying dense teacher supervision on student-sampled trajectories. However, scaling OPD to long-horizon mathematical reasoning exposes a reliability and efficiency problem: standard OPD assigns every sampled candidate the same long rollout budget, even though some trajectories may quickly become weakly aligned with the teacher and provide less useful supervision. Prior analyses suggest that successful OPD depends on local teacher-student compatibility, which can be measured by top-k overlap on student-visited prefixes. When this overlap is low, continuing to generate or train on long suffixes may waste computation and introduce noisy learning signal. To address this, we introduce Prefix-Guided On-Policy Distillation (PG-OPD), a simple rollout-allocation framework that uses fixed-length prefixes to estimate trajectory value before expensive long-horizon generation. PG-OPD first decodes every sampled candidate to the same prefix length, computes teacher-student top-k overlap within an early probe window of that prefix, and selectively continues high-overlap candidates to a fixed long length. Low-overlap candidates stop at the fixed prefix, avoiding unnecessary suffix generation. Across diverse teacher-student combinations on AMC, AIME, and HMMT benchmarks, PG-OPD improves average accuracy by up to 4.80 points while reducing training time by up to 2.46x. These results suggest that prefix-level compatibility provides a practical signal for directing OPD computation toward trajectories that remain learnable from the teacher.

[LG-109] VegSim: A Geospatial World Model for Scenario-Conditioned Vegetation Simulation

链接: https://arxiv.org/abs/2606.21961
作者: Irene Iele,Elena Mulero Ayllón,Paolo Soda,Matteo Tortora
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vegetation monitoring under climate stress requires answering not only how it will evolve given the expected weather, but how it would respond to alternative meteorological conditions. Forecasting models return the expected vegetation state for the observed weather and cannot answer these scenario-conditioned questions, because future weather is fixed to the recorded trajectory. We present VegSim, a geospatial world model for scenario-conditioned vegetation simulation. VegSim infers a latent vegetation state from sparse satellite-derived NDVI histories, past meteorological covariates, and static spatial context, propagates it forward under future weather forcing through recurrent latent dynamics, and decodes predictive NDVI quantiles at each lead time. Because future forcing enters as a controllable input, the same trained model supports probabilistic forecasting under observed weather and conditional simulation under user-defined meteorological forcing, without supervision on scenario responses. We evaluate VegSim on GreenEarthNet across in-distribution data and spatial, temporal, and joint spatial-temporal shift, where it achieves strong point and probabilistic accuracy against time series and Earth observation forecasting baselines while using a compact architecture. We then simulate vegetation responses across Europe under four meteorological scenarios, and in a France summer 2022 case study, obtaining spatially coherent patterns consistent with known sensitivity to temperature and precipitation. The code is available at this https URL.

[LG-110] A Standard Processing Pipeline for High-accuracy Measurement of Few-shot Regression on Laser Induced Breakdown Spectroscopy

链接: https://arxiv.org/abs/2606.21960
作者: Hao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Laser-induced breakdown spectroscopy (LIBS) faces challenges in high-accuracy quantitative measurement under few-shot scenarios due to spectral noise and data scarcity. Traditional preprocessing methods often fail to preserve subtle spectral features or capture nonlinear correlations. This work proposes a standardized processing pipeline integrating diffusion-based denoising, attention-based autoencoder for dimensionality reduction, group shuffling data augmentation, and ordinary least squares regression. The diffusion module employs a 3D UNet architecture to remove spectral noise while preserving essential emission features. The attention-autoencoder captures nonlinear spectral correlations, effectively reducing high-dimensional spectral data to compact latent representations. Group shuffling data augmentation enhances model robustness by creating synthetic samples through feature group permutation. Experimental results on multiple elemental concentrations demonstrate that our Diffusion-DA-AE pipeline achieves superior performance with a mean RMAE of 0.2847, representing 37.7% and 37.6% improvements over baseline autoencoder and traditional PCA-PLS regression, respectively. The framework’s effectiveness validates its generalizability and establishes a new benchmark for few-shot LIBS regression.

[LG-111] Learning by Shifting: Temporal View Construction for Time Series Contrastive Learning ECML KDD

链接: https://arxiv.org/abs/2606.21957
作者: Abdul-Kazeem Shamba,Kerstin Bach,Gavin Taylor
类目: Machine Learning (cs.LG)
*备注: Published in the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD)

点击查看摘要

Abstract:Supervised learning demands large quantities of labeled data, a bottleneck that is expensive and reliant on domain-specific expertise. Self-supervised learning, particularly contrastive learning, has emerged as a compelling alternative, enabling rich representation learning directly from unlabeled data. Yet its success hinges critically on the design of positive and negative sample pairs. Existing approaches for time series rely on hand-crafted augmentations and masking heuristics that embed strong domain assumptions, often limiting generalization across diverse temporal patterns and potentially introducing spurious correlations. In this work, we challenge this paradigm by demonstrating that explicitly encoding temporal shift invariance through a simple, deterministic view construction is sufficient to learn strong representations for time series classification. By exploiting temporal structure, our method, Shift Invariant Feature Training (ShiFT), achieves state-of-the-art performance on six diverse real-world time series benchmark datasets, as well as the UCR and UEA archives, while reducing training time. Beyond empirical performance, we present a systematic analysis of contrastive learning dynamics in time series settings, examining the effects of batch size and the number of negatives on downstream performance. Our findings provide practical insights for designing efficient contrastive learning frameworks for time series representation learning. The source code is publicly available at this https URL.

[LG-112] On the Curse of Dimensionality in Private Sparse Covariance Estimation and PCA

链接: https://arxiv.org/abs/2606.21951
作者: Syamantak Kumar,Shourya Pandey,Purnamrita Sarkar,Kevin Tian
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study high-dimensional differentially private (DP) covariance estimation in the operator norm, and principal component analysis (PCA), under k -row-column sparsity ( k -RCS) of the covariance matrix. In the non-private setting, it is known that \mathsfpoly(k, \log d) samples suffice to solve both of these problems. However, the only comparable result known under DP (Wang et al. 2021) requires \Omega(d) samples under standard parameterizations of the problem. We investigate when this curse of dimensionality is inherent for sparse covariance estimation tasks under DP. On the upper bound front, we show that a \mathsfpoly(k, \log d) sample complexity for PCA is possible under DP, if we also posit sparsity of the leading eigenvector. We complement this result with \mathsfpoly(d) lower bounds under DP for both sparse covariance estimation and PCA, establishing an exponential gap between the private and non-private variants of these problems when k = \mathsfpolylog(d) . To our knowledge, no such separation has previously been demonstrated for any sparse estimation problems in private high-dimensional statistics. Our techniques are flexible enough that they imply stronger lower bounds even for the well-studied problem of standard DP PCA, without sparsity assumptions. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST) Cite as: arXiv:2606.21951 [cs.LG] (or arXiv:2606.21951v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.21951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-113] DevoTG: Temporal Graph Neural Networks for Modeling C. elegans Developmental Connectomics

链接: https://arxiv.org/abs/2606.21940
作者: Jayadratha Gayen,Bradly Alicea
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 13 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Understanding how a nervous system wires itself from birth to adulthood is a fundamental challenge in developmental neuroscience. We present DevoTG, a temporal graph framework that applies Temporal Graph Neural Networks (TGNs) to two complementary representations of C. elegans neural development: a Continuous-Time Dynamic Graph (CTDG) of cell division events derived from cell lineage data, and a Discrete-Time Dynamic Graph (DTDG) of the developing synaptic connectome spanning eight reconstructed electron-microscopy datasets. On the lineage prediction task, our TGN achieves a mean test AUC of 0.839 +/- 0.007 (5 seeds; validation AUC 0.937 +/- 0.001), outperforming a static GNN with the identical architecture by 26 AUC points (0.577 +/- 0.080), demonstrating that temporal memory is the decisive factor. Applied to the connectome DTDG, DevoTG identifies three connection stability classes (stable, developmental, and variable) across 225 neurons and 858 to 2,496 connections over development (L1 birth to adult), providing a temporal-graph-theoretic complement to the individual-variability classification of Witvliet et al. Analysis of hub command interneurons AVA, AVB, and AVE reveals their persistent centrality and how their integration roles are progressively reinforced across larval stages. Accompanying interactive visualizations (3D animated networks, centrality heatmaps, and a spatiotemporal lineage graph) make developmental dynamics accessible for biological hypothesis generation. DevoTG is open-source and designed for extension to other developing nervous systems. Code is publicly available at this https URL.

[LG-114] Selective Ensemble Based on Preference-Directed Multi-Objective Bandits

链接: https://arxiv.org/abs/2606.21929
作者: Lanjihong Ma,Zhen-Yu Zhang,Masashi Sugiyama,Zhi-Hua Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selective ensemble for modern machine learning systems requires choosing promising model candidates under limited evaluation budgets, while downstream tasks often specify only partial preferences over capabilities such as accuracy, robustness, and reasoning. This setting naturally gives rise to a sequential decision problem under partially specified linear preferences. We formalize it as preference-directed multi-objective bandits (PDMOB), where admissible trade-offs are represented by a polyhedral preference cone. Based on this formulation, we introduce Pareto C -optimality, which recovers standard Pareto optimality and single-weight scalarization as special cases. We then propose the preference-directed upper confidence bound (PrefUCB) algorithm, which maintains directional confidence intervals to guide exploration. We analyze both indicator-based and gap-weighted regret, and establish instance-dependent logarithmic bounds for both criteria, recovering the optimal logarithmic dependence on the horizon T in classical special cases. Experiments on large pre-trained model selective ensemble tasks and online asset allocation under institutional mandates validate the efficacy of our method.

[LG-115] Data Pruning: Redundant Problematic and Interdependent Samples

链接: https://arxiv.org/abs/2606.21916
作者: Leon Freese,Marthinus W. Theunissen
类目: Machine Learning (cs.LG)
*备注: This work is a preprint of a published paper by the same name

点击查看摘要

Abstract:The performance of deep learning models is affected by not only data quantity but also data quality. Data pruning is a process by which practitioners can reduce the size of a dataset by only keeping the most important training data points, thereby achieving similar test set performance. We empirically investigate two popular data pruning methods under noisy and noiseless conditions and show that these methods fail in the presence of significant label noise. We highlight that the success of data pruning is distinctly affected by three factors: redundancy in the dataset, the presence of problematic samples, and interdependence between samples. We perform a detailed investigation on commonly used benchmark classification datasets and neural network architectures. We find that our observations are consistent across data distributions and training protocols.

[LG-116] Continuous Behavioral Authentication via Multi-Expert BERT Log Analysis for Secure Data Sharing

链接: https://arxiv.org/abs/2606.21900
作者: Stergios Lantzos,Ilias Syrigos,Apostolos Apostolaras,Thanasis Korakis
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous authentication for mobile and zero-trust systems requires nonintrusive evidence confirming the enrolled user-device context remains valid after initial login. This paper presents a BERT log analysis framework for continuous behavioral authentication using Android system logs. The proposed pipeline parses logcat streams into event templates and dynamic variables, pre-trains a domain-adapted BERT encoder on Android log syntax, and fine-tunes three expert models for network/device identity, battery-transition timing, and Wi-Fi topology. The expert confidence scores are fused through a log-space transformation and a 5-nearest-neighbor distance classifier to generate a normality score that is provided to a Policy Decision Point (PDP) for risk-aware access control. Experiments on normal traces, controlled anomaly injections, and benign Wi-Fi perturbations indicate that multi-expert BERT log analysis can detect semantic, battery-timing, and topology deviations in the evaluated setting while maintaining sub-1% False Positive Rate (FPR). The results suggest that Android system logs are a practical sensor-free signal for continuous authentication and user-device context assurance.

[LG-117] Ranking-and-Selection with Multiple Correct Answers and Non-Answerable Estimates

链接: https://arxiv.org/abs/2606.21889
作者: Qiaoqiao Wang,Wei You
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for oral presentation at the 2026 Winter Simulation Conference

点击查看摘要

Abstract:We study fixed-precision ranking-and-selection in structured settings where the answer may be non-unique and where noisy estimates may temporarily admit no valid answer at all. This phenomenon arises naturally in problems such as multi-fidelity ranking-and-selection and identifying a Condorcet winner from pairwise comparisons. To address this, we propose a unified framework based on answer-wise acceptance sets, restricted generalized likelihood ratio stopping, and an answer-pitfall decomposition that yields a max-max-min characteristic value and a common sampling principle. We introduce ENDS, a general procedure that combines estimation, nomination, pitfall detection, and cost-aware information-directed selection. We instantiate ENDS for various problems by deriving explicit formulas. Extensive numerical experiments show that this unified recipe performs well across a broad range of pure-exploration problems and offers a practical framework and proof-of-concept algorithmic recipe.

[LG-118] WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware

链接: https://arxiv.org/abs/2606.21868
作者: Jiamu Zhang,Liang Wu,Mayank Darbari,Liangjie Hong
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, 9 tables. Preprint, work in progress

点击查看摘要

Abstract:Modern Mixture-of-Experts (MoE) models place most of their parameters in expert layers, yet only a small fraction of those experts are used for any token. The unused weights must still be stored where the GPU can reach them. On commodity GPUs the common fix is layer-level CPU offloading, which keeps memory low but streams all of a layer’s experts across PCIe on every forward pass, losing much of MoE’s sparsity benefit. We cast low-resource MoE serving as a working-set management problem on the GPU: routed expert weights and the key-value (KV) cache are two streams of memory demand competing for limited VRAM. We realize this in WiSP (Working-Set Paging), a routing-aware expert pager that plugs into an unmodified serving engine with byte-identical outputs. Keeping resident only the experts a workload reuses, WiSP reaches up to 1.95x the decode throughput of static offload at the same memory budget when the model does not fit. We also find that prefetching experts from predicted routing helps little in single-stream decode: the bottleneck is PCIe bandwidth, not prediction accuracy. This shifts the question from prefetching to allocation: how should VRAM be split between experts and the KV cache? We answer with MV-WSA (Marginal-Value Working-Set Allocation), which equalizes marginal latency benefit per byte subject to a KV admission floor. MV-WSA runs either as an offline configurator or as an online controller that resizes both pools while serving. In real serving the offline configurator is the only policy we test that does well on both prefill and decode; in trace-driven simulation it stays within a few percent of a per-workflow oracle while fixed splits are about 20% worse. The online controller adds a further 1.20x without changing model outputs.

[LG-119] Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials ICML

链接: https://arxiv.org/abs/2606.21830
作者: Sarrah R. Mikhail Leung,Taehan Kim,Jeongbin Park
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, Accepted at ICML AI4Physics 2026 Workshop

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has driven rapid progress in mathematical and code reasoning, but when extended to science, existing benchmarks do not decompose what generalizes: do gains reflect structural transfer, property transfer, or memorization? We introduce Mat-Pref, a benchmark of 10,837 ionic-substitution questions across 11 inorganic structure families, grounded in density functional theory calculations from the Materials Project, with three evaluation splits that isolate in-distribution performance, generalization to entirely held-out structure families, and cross-property transfer: applying band-gap reasoning to hosts seen during training only through formation-energy supervision. Four zero-shot frontier models (70-671B parameters) remain in the 33-54% range on every split, confirming that scale alone does not resolve the compositional chemical reasoning this task demands. A two-stage pipeline of supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) lifts Qwen3-8B to 65.2% in-distribution and 71.6% on held-out families, exceeding zero-shot Qwen3-235B by over 20 percentage points on both structural-generalization splits. Self-consistency sampling shows that the SFT policy can already produce correct answers but cannot reliably surface them as the modal response; GRPO reshapes the distribution so that correct answers become modal rather than merely reachable, and this sharper commitment is visible mechanistically: logit lens analysis reveals a \sim 20pp advantage in answer crystallization at the critical decision layer. We formalize this observation as a distractor-permutation consistency metric under which GRPO narrows the gap between lenient scoring (at least one permutation correct) and strict scoring (all permutations correct) from 24.0 to 14.3 percentage points.

[LG-120] Spectrally Safe Neural Operator Warm-Starts for Large-Scale Newton Solvers

链接: https://arxiv.org/abs/2606.21828
作者: Jaemin Oh,Youngkyu Lee,Jerome Darbon,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Neural operators are increasingly used to warm-start Newton solvers for nonlinear PDEs, on the premise that a low test error places the initial guess inside the basin of attraction. We show that this premise is unreliable. An operator trained to the relative (L^2) error (O(10^-3)) can still produce an initial state in which the discrete Jacobian is indefinite, because the mean-squared training controls error on average while leaving localized pointwise violations of the underlying physics. For a nearly incompressible hyperelasticity problem, we trace this to the predicted volume change: the operator disperses (\mathrmdet F) well away from one, and the resulting Jacobian acquires negative eigenvalues even when the predicted field is visually indistinguishable from the reference. At a small scale, this is a nuisance; at a multi-million degree-of-freedom scale, it is disqualifying, since the conjugate gradient and other Krylov solvers needed for memory-feasible Newton steps assume a definite spectrum. We then show that a short, label-free fine-tuning phase – penalizing the operator against the discrete energy, with no additional solution data – shifts the Jacobian spectrum back to positive definite. Combined with an inexact outer loop, this gives a warm-started Newton method that converges across the full loading range where the unregularized operator fails, reaching up to 5.4(\times) wall-clock speedup over incremental continuation on a 3D problem with 6.4 million degrees of freedom.

[LG-121] Causal Gaussian Processes for Robust Treatment Effect Evaluation with Unobserved Confounding

链接: https://arxiv.org/abs/2606.21809
作者: Junzhe Zhang,Jingyuan Chen,Elias Bareinboim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The presence of confounding bias poses a key challenge in policy evaluation, as the target causal effects of actions are not identifiable (i.e., underdetermined) from observational data. On the other hand, existing confounding-robust evaluation strategies require detailed prior knowledge about the environment or apply only to discrete treatments and outcomes. This paper investigates causal effect evaluation over the continuous domain from confounded observations, while requiring only basic temporal ordering between the treatment and the outcome. We introduce a universal discretization of the exogenous domains that approximates the observational and interventional distributions of any causal model with arbitrary accuracy using a finite number of latent states. Building on this newfound universal approximation property, we develop a novel family of Causal Gaussian process (CGP) models that effectively approximate the observational and interventional distributions of any causal model with confounded observations.

[LG-122] Causal Variational Deep Embedding: A Family of Interventional Generators for Confounded Images

链接: https://arxiv.org/abs/2606.21806
作者: Jingyuan Chen,Kangrui Ruan,Junzhe Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models reproduce the observational distribution of their training data, inheriting any spurious associations it contains. A common source is an unobserved confounder that shapes both an attribute the user wants to control at sampling time and an attribute expected to vary in response. Existing causal generative approaches resolve the resulting ambiguity by imposing structural assumptions strong enough to single out one interventional distribution; in image domains, such assumptions are rarely warranted, and the data is generally consistent with a set of distinct causal mechanisms – a feasible region of interventional distributions. We propose CauVaDE (Causal Variational Deep Embedding), built on a canonical augmented SCM in which the unobserved confounder collapses, without loss of generality, into a discrete latent cluster of bounded support while continuous variation is absorbed into independent noises. We prove that this canonical class is dense, in both observational and interventional Wasserstein distance, in the class of augmented SCMs compatible with a given causal diagram, and instantiate it as a mixture variational autoencoder whose cluster variable plays the role of the canonical confounder. An entropy regularizer with weight \gamma on the cluster posterior then traces a family of candidate causal effects that fit the observational data to comparable likelihood while spanning the feasible region. Experiments on image data benchmarks show that CauVaDE produces diverse interventional samples and improves FID against an unconfounded reference.

[LG-123] Discretizing Reward Models

链接: https://arxiv.org/abs/2606.21795
作者: Vijay Viswanathan,Shiqi Wang,Devamanyu Hazarika,Chirag Nagpal,Tongshuang Wu,Graham Neubig,Yuning Mao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike “verifiable rewards” which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of “reward model accuracy,” we propose evaluating reward models using distinct measures of “discriminative ability” and “specificity” (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards.

[LG-124] What Do Lorentz-Equivariant Jet Taggers Learn? ICML2026

链接: https://arxiv.org/abs/2606.21790
作者: Jay Agarwal,Siddharth Khare,Dhruv Kumar
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted at the AI4Physics Workshop, ICML 2026. 21 pages, 15 figures

点击查看摘要

Abstract:We study what Lorentz-equivariant jet taggers learn internally, using equivariance tests, linear probes and grade ablations across five models including L-GATr, L-GATr-slim and LLoCa-T. Linear probes show that equivariant models suppress frame-dependent pseudorapidity to zero while encoding jet mass and N-subjettiness strongly. Grade ablations on L-GATr reveal that bivector channels are negligible for top-quark tagging while vector-like channels are dominant but seed variable, consistent with the network exploiting multiple representational pathways. These results characterize which physical features and algebraic grade structures carry discriminative information in equivariant taggers and may inform future development of such models.

[LG-125] RocketPFN: Accurate Time Series Classification via In-Context Learning

链接: https://arxiv.org/abs/2606.21786
作者: Franco Martino O’Rourke,Ana Trisovic,Dimitris Bertsimas
类目: Machine Learning (cs.LG)
*备注: 10 pages main text, 4 figures; 15 pages including references and appendix

点击查看摘要

Abstract:We introduce RocketPFN, a training-free pipeline for time series classification that combines random convolutional feature extraction (Rocket) with in-context classification via a pretrained tabular foundation model (TabPFN v2.5). On 92 UCR datasets (30-resample protocol), RocketPFN matches HC2, the strongest published method on the archive, in mean accuracy (both 0.900, Wilcoxon p=0.50), with no training on the target data and a median inference time of 30 seconds per fold. It also significantly outperforms every individual classifier in the HC2 ensemble. On UEA (20 datasets) the difference is likewise not statistically significant. A separate comparison concerns TSC foundation models: when paired with the same downstream classifier, MOMENT, Mantis, and MantisV2 are all significantly outperformed by RocketPFN using fewer extracted features and no learned parameters (p0.001 in each case). This holds even when the encoders were pretrained on corpora that include the UCR training samples. We propose this two-stage pipeline as a reference point for evaluating zero-shot TSC foundation models.

[LG-126] A Causal DAG Prior for Synthetic Time-Series Classification Datasets ICML2026

链接: https://arxiv.org/abs/2606.21776
作者: Franco Martino O’Rourke,Ana Trisovic,Dimitris Bertsimas
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2nd ICML 2026 Workshop on Foundation Models for Structured Data (FMSD). 4 pages (main text), 2 figures, plus references and appendix

点击查看摘要

Abstract:A Prior-data fitted Network learns the posterior predictive induced by its training prior; bringing this paradigm to multivariate time-series classification therefore calls for a synthetic generator that produces complete labelled datasets with temporal structure. We introduce a causal prior that synthesizes each dataset from a randomly sampled DAG over typed nodes across two modalities (tabular attributes and time series), natively producing multivariate, multi-class TSC datasets with cross-modal causal structure across channels, timesteps and labels, a regime not addressed by existing synthetic priors. To validate the prior, we finetune TabPFN v2.5 with minimal adaptations and evaluate on 75 UCR/UEA datasets within TabPFN’s operating regime. Finetuning on our generator significantly outperforms both the unmodified upstream model and a tabular-only ablation of the same prior (Wilcoxon signed-rank p=3.0\times 10^-8 on ROC-AUC), isolating the contribution of the cross-modal temporal structure.

[LG-127] Decision-Focused Learning: When and Why Traditional Prediction Models Fail

链接: https://arxiv.org/abs/2606.21773
作者: Mo Liu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Plugging predictions of unknown parameters into downstream optimization problems, often referred to as the ``predict-then-optimize’’ paradigm, has long been a standard approach in decision-making under uncertainty. However, improved predictive accuracy does not, in general, translate into improved decision quality. This disconnect has motivated growing interest in decision-focused learning (DFL) within the operations research community. This tutorial reviews recent developments in DFL and highlights key methodological insights, with a particular focus on stochastic linear programming as the downstream decision-making problem. We discuss why several widely used tools in traditional statistical learning are not directly suited to decision-focused settings and must be rethought, including (i) data collection strategies driven purely by predictive uncertainty and (ii) distributional distance measures such as the Wasserstein distance. We summarize properties of DFL that distinguish it from conventional predictive modeling and provide insights into the development of new decision-focused tools.

[LG-128] AdaPrivate-TS: Private Thompson Sampling for Contextual Bandits with Privacy Amplification

链接: https://arxiv.org/abs/2606.21757
作者: Mohammadreza Riyazat,Eranga Ukwatta
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: Accepted at the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026) as a long paper; selected for oral presentation. 12 pages, 6 tables

点击查看摘要

Abstract:We present AdaPrivate-TS, a differentially private contextual bandit algorithm that combines Thompson Sampling with batched zCDP composition. Our key insight is that differential privacy noise inflates the posterior covariance in a structured way: adding Gaussian noise N(0,\sigma^2 I) to b yields sampling covariance v^2 A^-1 + \sigma^2 A^-2 , which Thompson Sampling interprets as increased uncertainty rather than pure corruption. Under event-level privacy (protecting individual interactions) with stochastic contexts, we prove that the privacy cost is only O(\sqrtd,\log T/\sqrt\rho) , logarithmic in T , because parallel composition amortizes noise across batches. Additionally, we explore privacy amplification via Poisson subsampling, which can reduce effective noise at stringent privacy budgets. Experiments on synthetic and real-world datasets demonstrate: (1) AdaPrivate-TS achieves 93-99% of non-private performance at \varepsilon \in [0.5, 5] , outperforming UCB by 0.5-3.7% and up to 18% with tuned adaptive exploration at extreme \varepsilon ; (2) privacy amplification provides additional 2-5% gains at low \varepsilon ; (3) on MovieLens and Jester, AdaPrivate-TS achieves the best overall performance among event-level baselines, dominating at \varepsilon \geq 2 ; (4) under DP-SVD private features, TS’s advantage over UCB grows to +11%, confirming noise-as-uncertainty is not limited to reward privacy. We provide rigorous proofs for privacy guarantees under interactive zCDP composition and comprehensive evaluation including convergence curves, 12-seed CIs, and DP-SVD feature ablation.

[LG-129] Embedding Linear Equality Constraints in Probabilistic Neural Networks for Dynamic Modelling

链接: https://arxiv.org/abs/2606.21728
作者: Matthew Marsh,Benoit Chachuat,Antonio del Rio Chanona
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are increasingly used to model chemical process systems, yet they often lack principled uncertainty quantification and mechanisms to enforce physical constraints. We propose a probabilistic neural network framework that guarantees satisfaction of linear equality constraints within a given tolerance, while capturing aleatoric uncertainty. Compared to state-of-the-art methods, our formulation demonstrates improved predictive accuracy, uncertainty calibration, and adherence to constraints on reduced data. It also demonstrates competitive performance, but with significantly faster training times when evaluated on large data regimes. We evaluated this on two batch reactor case studies, enforcing mass balances.

[LG-130] BatchGen: An Architecture for Scalable and Efficient Batch Inference

链接: https://arxiv.org/abs/2606.21712
作者: Tairan Xu,Leyang Xue,Zhan Lu,Jinfu Deng,Hongyang Xiao,Yinsicheng Jiang,Congjie He,Matej Sandor,Le Xu,Luo Mai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Batch inference has become a central mode of AI computation, yet existing inference engines still rely on execution models designed for interactive serving. When scaled to millions of sequences, batch workloads reveal two fundamental requirements: the ability to handle extreme inter- and intra-sequence load variation that emerges only at runtime, and the ability to sustain high utilization across large fleets of GPUs. Existing systems fail to meet these requirements, losing substantial fractions of achievable throughput. We introduce a new architectural foundation for batch inference: the sequence coroutine compute model, which represents each sequence as a fine-grained, event-driven coroutine. This model exposes expressive primitives that allow the runtime to reorganize work dynamically, enabling larger expert-level batches, mitigating stragglers, reallocating work across devices, and maintaining utilization even on cost-effective or memory-constrained GPUs. Building on this abstraction, we implement BatchGen, a production-ready system that uses the coroutine model at cluster scale. On a 128-GPU cluster, BatchGen reduces batch completion time by up to 2.3\times , and on memory-constrained accelerators it outperforms the strongest offloading baseline by up to 9.6\times . We will open-source BatchGen at this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2606.21712 [cs.DC] (or arXiv:2606.21712v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.21712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-131] Expressivity Saturation: Reduced Affine Region Usage Under Increasing Task Complexity

链接: https://arxiv.org/abs/2606.21687
作者: Xuan Qi,Yi Wei,Fanqi Yu,Manuel Lecha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Piecewise-affine neural networks (e.g., with ReLU or LeakyReLU activations) implement continuous piecewise-affine maps, and the number of affine regions provides a natural proxy for expressive capacity. However, the gap between theoretical region capacity and the affine regions realized after training remains insufficiently understood. We study this gap from two complementary perspectives. First, we give a rigorous, architecture-dependent theorem for affine line-segment probes: for multilayer perceptrons with piecewise-affine activations, the number of affine pieces realized along an affine line-segment probe is upper bounded by an explicit product of layer-wise width terms (and activation breakpoint factors). This yields a neuron-threshold lower bound for representing target functions with prescribed one-dimensional piece complexity, formalizing the minimal region budget required for complex signals. Second, we exactly enumerate affine regions realized within bounded 2D and higher-dimensional domains under controlled task complexity. Under fixed architectures and training protocols, increasing input–label complexity yields trained solutions with markedly fewer realized regions in the evaluation domain, even though worst-case architectural capacity is unchanged; we call this reduced region usage expressivity saturation. Moreover, in the most challenging regimes, 2D visualizations show that region-usage collapse often coincides with degraded decision boundaries. Finally, we visualize the training dynamics of affine-region partitions and decision boundaries, revealing a consistent refinement process during optimization.

[LG-132] A Framework for Directed Acyclic Hypergraph Learning

链接: https://arxiv.org/abs/2606.21668
作者: Zhiyuan Dong,Carlos Mundo-Levano,Wei Qian,Daniel Lau,Gonzalo R. Arce
类目: Machine Learning (cs.LG)
*备注: 3 pages. Accepted and presented as an oral presentation at the 9th Graph Signal Processing Workshop (GSP 2026), June 8-10, 2026, Madrid, Spain

点击查看摘要

Abstract:Continuous optimization methods for learning Directed Acyclic Graphs (DAGs) operate on weighted adjacency matrices and are therefore limited to pairwise causal relationships. We propose a framework for learning Directed Acyclic Hypergraphs (DAHGs) from observational data, capturing joint parental influences that pairwise models cannot represent. Our approach rests on three components: (i) a generalized linear structural equation model (SEM) with multiplicative interaction terms whose non-zero weights correspond one-to-one with directed hyperedges; (ii) a weighted adjacency tensor representation whose acyclicity is characterized via nilpotency under the tensor t-product; and (iii) a differentiable acyclicity constraint derived through the Fourier decomposition of the t-product, which reduces tensor nilpotency to slice-wise matrix nilpotency and enables least-squares learning via the augmented Lagrangian method.

[LG-133] A new classification method based on Minimum Spanning Trees

链接: https://arxiv.org/abs/2606.21639
作者: Julio González-Díaz,Beatriz Pateiro-López,Iria Rodríguez-Acevedo
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Minimum Spanning Trees have been used in unsupervised learning, particularly in clustering tasks, due to their ability to recognize clusters by removing edges that are considered inconsistent in defining those clusters. This paper aims to study the use of Minimum Spanning Trees in supervised learning. Specifically, we propose a classification algorithm based on Minimum Spanning Trees. To improve its performance, we introduce a robust version of the method that is also computationally more efficient. We evaluate the effectiveness of our proposed method through an extensive simulation study. We also apply the proposed methodology to a real-world case study involving aircraft trajectories.

[LG-134] HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

链接: https://arxiv.org/abs/2606.21633
作者: Omin Kwon,Doyeon Kim,Jongseok Park,Seung Yul Lee,Ion Stoica,Jae W. Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion LLMs (dLLMs) improve GPU utilization over autoregressive decoding by generating multiple tokens per forward pass, but their KV cache still grows linearly with context, limiting throughput at long contexts. KV cache offloading to host DRAM alleviates this memory pressure, but the limited PCIe bandwidth necessitates recalling only a sparse subset of KV entries. In block dLLMs, the relevant KV entries remain consistent across denoising steps within a block, enabling high-accuracy selection by identifying the top-k entries once and reusing them throughout all denoising steps. This property appears attractive for offloading as it amortizes the selection overhead across the entire block, but it requires exact attention over the full KV cache, which is too expensive under offloading. We present HERALD, a KV offloading system for block dLLMs that resolves this through two opportunities that reduce the required selection compute by a factor of the block size and enable selection to be overlapped with denoising. Across three block dLLMs and five long context tasks, HERALD achieves near-lossless accuracy at 5-10% KV budget and up to 1.59x lower per block latency and 2.47x higher throughput over GPU-only inference, with speedups growing with context length.

[LG-135] he Alignment Problem in Constrained Code Generation

链接: https://arxiv.org/abs/2606.21619
作者: Matteo Biagiola,Jahrim Gabriele Cesario,Luca Di Grazia,George Zakhour,Guido Salvaneschi
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in code generation, but their outputs frequently contain syntax or type errors that result in compilation failures. Constrained decoding has been proposed as a solution to mitigate compilation errors by construction, improving functional correctness as a byproduct. However, previous works overlook a critical aspect of constrained decoding: the alignment between constrainer (e.g., types), language model and the target specification language (e.g., TypeScript). Misalignment is caused by the constrainer being incomplete–rejecting programs that belong to the target–or unsound–allowing programs that are not part of the target. The bias created by incompleteness distorts the language model distribution, and can be detrimental for code generation. We evaluate this hypothesis using seven language models, two target languages, two constrainers, enforcing types and syntax during decoding, and we study how language models react to varying levels of incompleteness. On three benchmarks, when the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%. These contributions make the community aware of the negative effects of misalignment in constrained decoding, and provide quantitative insights on how to design constrainers that are beneficial for code generation systems with formal guarantees. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2606.21619 [cs.SE] (or arXiv:2606.21619v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.21619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-136] Learning to Place Guards by Reinforcement: A Geo-Free Neural Policy for the Vertex-Guard Art Gallery Problem

链接: https://arxiv.org/abs/2606.21604
作者: Domagoj Ševerdija,Jurica Maltar,Nathan Chappel,Domagoj Matijević
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:Neural combinatorial optimization (NCO) has shown that policies trained by reinforcement can construct strong solutions to NP-hard problems directly from raw instances. What such a policy actually learns, as opposed to what its decoder expresses, remains much less clear. We study this distinction on the vertex-guard Art Gallery Problem, the NP-hard task of choosing polygon vertices from which to observe an entire region. A pointer-network policy is trained from a coverage-aware reward over its own rollouts under the constraint we call geo-free inference: at test time it sees only vertex coordinates, with no visibility computation and no geometric oracle. The policy places guards economically but leaves a tail of under-covered polygons that widens far beyond the training range. To locate the cause, we freeze the trained encoder and read its embeddings with a small single-shot classifier, still geo-free at inference. The classifier closes most of the feasibility gap, in and out of distribution and at up to roughly five times the training range, cutting under-covered polygons by about an order of magnitude at an explicitly reported cost in guard count. We read this as evidence that the reinforcement-trained representation already encodes the geometry required for feasibility, and that residual failures reflect decoder calibration rather than missing knowledge. Probing a frozen encoder thus offers a practical way to ask what a neural combinatorial solver has internalized.

[LG-137] Geometric and Information Compression of Representations in Deep Learning ECML KDD2026

链接: https://arxiv.org/abs/2606.21593
作者: Linara Adilova,Henning Petzka,Asja Fischer,Bernhard C. Geiger
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Published at ECML PKDD 2026; Dataset: this https URL . Code: this https URL

点击查看摘要

Abstract:Deep neural networks transform input data into latent representations that support a wide range of downstream tasks. These representations can be characterized along information-theoretic and geometric dimensions, but their relationship remains poorly understood. A central open question is whether low mutual information (MI) between inputs and representations necessarily implies geometrically compressed latent spaces and vice versa. We investigate this question using class-wise clustering as a measure of geometric compression and theoretically sound MI estimation in conditional entropy bottleneck (CEB) networks and continuous dropout networks. We evaluate the interplay between MI, geometric compression, and generalization on classification tasks under controlled noise injection schemes. Our findings show that low MI does not reliably correspond to geometric compression, and that the connection between the two is more nuanced than often assumed. Indeed, our experiments reveal a negative and nonlinear relationship that can reverse when varying training setup. Our results put forward a hypothesis that generalization acts as a potential confounder in this connection rather than being their direct consequence.

[LG-138] he Cost Geometry of Belief: finite-resource inference under noisy observation

链接: https://arxiv.org/abs/2606.21585
作者: Laurent Caraffa
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Differential Geometry (math.DG); Statistics Theory (math.ST)
*备注: 21 page

点击查看摘要

Abstract:We equip the space of beliefs with a cost geometry (what it costs to pass from one belief to another): optimal transport in Wasserstein space, reweighted conformally by Fisher information (the price of the precision at stake), distinct from the Fisher-Rao metric. In the setting we consider, a finite machine maintains a digital twin of a system; observing the territory through finite, noisy sensors, we model its coherent output as a belief: a probability density over states, the Bayes posterior. Certainty (the perfect twin) is denied twice, by observation and by physics, both read off the Fisher information. On the conformal class, essentially location-scale, three results emerge, all invariants of one change of cost unit. A wall: a well-posed inference rejects certainty to infinite distance as soon as the cost dominates the Fisher information (necessity conjectured beyond power laws). An honesty: an honest (eikonal) cost, each nat the same length everywhere, selects the geometries proportional to the Fisher information. A rigidity: these geometries are hyperbolic, and the Stam bound crowns the Gaussian, the most hyperbolic location-scale belief. Changing the unit dilates the geometry yet preserves the wall, the curvature ordering, and the extremality of the Gaussian: an absolute cost says nothing, only relative cost carries meaning, the value -1/4 being one of its images. The cost of reaching a given precision then has a geometric floor diverging at certainty. Thermodynamics fixes the cost unit and motivates this framework; the results are geometric, in nats.

[LG-139] LIG: Layer-wise Integrated Gradients for Within-Layer Flow Analysis in Transformers

链接: https://arxiv.org/abs/2606.21564
作者: Eight Suzuki,Hideitsu Hino,Noboru Murata
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 1 table. cs.LG. Experiments on BERT-base and PTB. Code: this https URL

点击查看摘要

Abstract:Transformers achieve strong performance, but their internal computations remain opaque. We view each Transformer layer as a dynamic graph whose nodes are token representations and per-head attention outputs, with Multi-Head Attention (ATT) and MLP as module boundaries. On this graph we use LIG (Layer-wise Integrated Gradients), which applies set-to-set Integrated Gradients (IG) at nonlinear module boundaries. Set-to-set IG applies IG to a map from a set of input token representations to a set of output representations, evaluating token-to-token contributions, which is not standard in prior IG applications. This extends IG from the usual scalar-objective setting to set-to-set maps via an L2 scalarization, and composes within-layer contributions in the spirit of Layer-wise Relevance Propagation (LRP), with IG completeness playing the role of LRP-style conservation at each boundary. We use LIG to analyze (i) the agreement between module-wise composition and layer-whole attribution under an L2 criterion, and (ii) within-layer information flow by tracing separated ATT and MLP contributions. On BERT-base and PTB, configurations that best preserved within-layer consistency used the target token’s embedding as the ATT baseline and either the ATT output at a=0 or Zero as the MLP baseline. We therefore present LIG as a diagnostic XAI tool at module-boundary granularity, without model-specific retraining or per-operation interpreter design. Code is available at this https URL.

[LG-140] owards Understanding the Power and Limits of the Muon Optimizer: A River-Valley Perspective

链接: https://arxiv.org/abs/2606.21514
作者: Tianqi Shen,Jinji Yang,Runze Shi,Jianhao Ma,Jiaye Teng,Ziye Ma
类目: Machine Learning (cs.LG)
*备注: 44 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Recently, Muon has gained substantial attention as an appealing alternative to Adam-like optimizers, with many works highlighting its advantages through spectral normalization and improved conditioning. Yet this positive theoretical narrative contrasts with its empirical performance in large language model (LLM) training, where Muon’s gains over Adam/AdamW are often mixed, schedule-sensitive, and not uniformly superior. To address this gap, we develop a trajectory-level theory characterizing both the strengths and limitations of Muon. We introduce a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, capturing a mixture of anisotropic structure and long-tail information reminiscent of LLM training. On top of it, we adopted a river-valley perspective in which we view the landscape as composed of a river direction flowing to the desired solution and hill directions encoding nuisance or task-irrelevant information. In the momentum-free setting, we show that Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. We then extend the river-valley perspective to general nonconvex objectives with momentum by studying points on the spectral river. There, while Muon converges faster early on, its orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution. Together, these results suggest that our characterizations extend beyond spiked matrix sensing and motivate switching to GD-like refinement optimizers in the final phase, rather than relying only on a fixed learning-rate schedule for Muon. We also provide preliminary evidence supporting this two-stage approach in language model training experiments.

[LG-141] Privacy-Preserving Federated Temporal Graph Learning with Digital Twin–Guided Adaptive Deception for Cyber-Resilient IoMT

链接: https://arxiv.org/abs/2606.21513
作者: Syed Zeeshan Haider,Anwar Shah,Muneeb Arif,Hamza Iftikhar,Waqas Ali
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid proliferation of IoT and IoMT devices introduces critical cybersecurity vulnerabilities in healthcare and industrial environments where resource-constrained devices operate under strict latency and data-privacy regulations. This paper presents the Federated Temporal Graph Convolutional Network with Advantage Actor-Critic (Federated TGCN-A2C), a privacy-preserving defense architecture integrating four mechanisms: a PyG-based Temporal GCN using GCNConv layers with global mean pooling and a learned anomaly gate for flow-level threat classification; LSTM-based Digital Twins generating per-device anomaly scores gating the classifier via learned sigmoid coupling; a Federated A2C agent selecting among ALLOW, ISOLATE, and HONEYPOT-REDIRECT actions based on a seven-dimensional state capturing confidence, entropy, anomaly magnitude, and traffic composition; and an enhanced honeypot layer converting suspicious traffic into threat intelligence with adaptive thresholds. Federated aggregation employs EMA-smoothed per-client validation losses as inverse-weighted FedAvg coefficients to stabilize global model updates under non-IID distributions, with cosine-annealed learning rates per round. Evaluated on CICDDoS 2019 and TON-IoT benchmarks, the framework achieves 99.48% and 99.61% test accuracy with weighted-F1 scores of 0.9948 and 0.9961, converging within 25 and 10 federated rounds, outperforming Fed-Inforce-Fusion by 0.21 percentage points while covering three additional attack categories. All sixteen CICDDoS 2019 classes achieve F1 of at least 0.9237 and all ten TON-IoT classes achieve F1 of at least 0.9488, including the severely imbalanced MITM category. Post-hoc explainability via SHAP, LIME, Grad-CAM, and counterfactual analysis confirms decisions are grounded in semantically meaningful flow features, supporting regulatory accountability in clinical deployments.

[LG-142] Robustness Cannot be Reduced to Regularization: Studying Adversarial Training Beyond the Linear Case

链接: https://arxiv.org/abs/2606.21488
作者: David A. R. Robin(LAMSADE, Dauphine),Rafael Pinot(LPSM, Jussieu),Yann Chevaleyre(LAMSADE, Dauphine)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The vulnerability of ML models to adversarial examples has recently emerged as a major concern. While adversarial training is one of the most effective countermeasures to this issue, its high computational cost remains an obstacle to practical deployment. Recent progress in reducing this cost has relied, in the case of linear models, on a formal equivalence between the adversarial risk and a simpler form of regularized risk. This enabled significantly more efficient training procedures, which naturally raises the question of whether such an equivalence can be extended beyond linear models. In this work, we formally show that no such equivalence is possible for two-layer networks. Our proofs proceed via a reduction to key properties that fundamentally separate the adversarial risk from any simple regularized risk which would only exhibit a weak form of data dependence. Beyond this setting, we provide empirical evidence on Wide-ResNets indicating that the same type of impossibility persists in deeper and more expressive architectures.

[LG-143] Deep Learning for Soil Moisture Estimation: Fusing Satellite Data with Optimally-Lagged Meteorological Features

链接: https://arxiv.org/abs/2606.21475
作者: Adrian Canovas-Rodriguez,Aurora González Vidal,Antonio F. Skarmeta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate soil moisture estimation in semi-arid agricultural regions requires integrating remote sensing and meteorological information while accounting for the delayed response of soil moisture to atmospheric forcing. This study introduces a Cross-Correlation Function (CCF) methodology to determine optimal temporal lags (0-30 days) between meteorological variables and soil moisture, as well as inter-depth lags (0-15 days) describing vertical moisture propagation from the surface (10 cm) to deeper layers (20-50 cm). The approach was validated across seven agricultural plots in southeastern Spain. Three deep learning architectures, each targeting a distinct prediction granularity, were evaluated under five feature configurations ranging from satellite-only to full satellite-meteorology-depth fusion: a CNN for per-pixel estimation within each plot, an LSTM for frame-level (daily plot-mean) prediction, and a CNN-LSTM hybrid operating on sliding windows with pooled multi-patch training. Models were assessed on held-out data to measure genuine generalisation. Meteorological variables improved performance over the satellite-only baseline, while subsurface depth information proved decisive across all architectures. The per-pixel CNN achieved the strongest single-patch result (R^2 = 0.877, RMSE = 2.28), with a seven-patch average R^2 of 0.535, representing an improvement of +1.00 over the satellite-only baseline. The pooled CNN-LSTM hybrid obtained the highest overall performance (R^2 = 0.930, CVRMSE = 8.0%). These results demonstrate that explicitly modelling atmospheric and vertical subsurface delays substantially improves soil moisture estimation for precision agriculture.

[LG-144] Post-Training Speech Enhancement Language Models with Perceptual Rewards INTERSPEECH2026

链接: https://arxiv.org/abs/2606.21458
作者: Frédéric Berdoz,Luca A. Lanzendörfer,Antonis Asonitis,Roger Wattenhofer
类目: Machine Learning (cs.LG)
*备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Speech enhancement language models achieve strong results when trained on discrete audio tokens, but their optimization relies on token-level cross-entropy rather than the perceptual metrics used for evaluation. We introduce a post-training stage for autoregressive speech enhancement language models using Group Sequence Policy Optimization (GSPO) with multi-metric perceptual rewards. Our method directly optimizes non-differentiable quality metrics (DNSMOS, WER, and UTMOS) as reward signals, without learned surrogates or offline preference pairs. Applied to two autoregressive base models, UniSE and GenSE, our approach achieves state-of-the-art results on the DNS2020 benchmark. A human evaluation ablation further shows that the composite multi-metric reward is preferred over any single-metric variant, confirming that multi-reward optimization avoids the reward hacking observed with single-metric training.

[LG-145] Fast-TurboQuant: A Multiplier-Free Online Vector Quantization Approach

链接: https://arxiv.org/abs/2606.21448
作者: Pedro M. R. Pereira,Felipe A. P. de Figueiredo,Rausley A. A. de Souza
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:As large language models scale, memory bandwidth for key-value caches and retrieval-augmented generation systems becomes a critical bottleneck. While 1-bit quantization addresses this constraint, recent TurboQuant relies on dense random rotation matrices to condition the vector distribution before quantization. This projection demands millions of floating-point multiplications per embedding, making it difficult to deploy on constrained edge silicon. We introduce Fast-TurboQuant, a multiplier-free projection architecture that replaces the dense matrix with a structured fast Johnson-Lindenstrauss transform. By applying a Rademacher phase inversion followed by a fast Walsh-Hadamard transform (FWHT), the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections. This substitution reduces the arithmetic complexity to only additions, eliminating hardware multipliers. Evaluation on DBpedia OpenAI-3 Large embeddings demonstrates a 19.7 times algorithmic speedup under sequential execution. Furthermore, the dimension expansion due to the FWHT zero-padding reduces the mean squared error and improves Recall@10.

[LG-146] Fusing Backdoors Machine Learning and Optimization for Large-Scale Parametric Mixed-Integer Programs

链接: https://arxiv.org/abs/2606.21440
作者: El Mehdi Er Raqabi,Pascal Van Hentenryck
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Large-scale optimization problems are often solved repeatedly under similar structural conditions, leading to substantial computational overhead. This occurs in applications such as power systems, transportation, and supply chain networks, where the underlying structure is fixed while parameters frequently vary under perturbations. This paper proposes a Learning to Optimize (LTO) framework that accelerates the solution of large-scale general mixed-integer problems by leveraging the concept of a backdoor, i.e., a subset of variables that drive most of the computational complexity. The proposed BIPC framework consists of three phases. Phase I is an identification procedure that discovers a backdoor for a set of instances in the distribution. Phase II uses supervised learning to develop machine learning models that, given an instance, predict values for bounded-domain backdoor variables and intervals for wide-domain backdoor variables. These predictions define a reduced optimization problem where the predictions constrain the backdoor variables, while the other variables remain free. Phase III optimizes this reduced problem and, if necessary, applies a correction step to restore feasibility or the optimality guarantees. Experiments on real-world, large-scale problems show substantial reductions in solution time with only a limited loss in solution quality. The framework enables organizations to solve large-scale optimization problems efficiently in the presence of frequent perturbations, such as unexpected events, demand fluctuations, or operational changes. Because these changes affect parameters rather than the problem structure, BIPC can quickly provide high-quality, feasible solutions, offering a practical approach to integrating machine learning into existing optimization pipelines. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2606.21440 [cs.LG] (or arXiv:2606.21440v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.21440 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-147] Universal Encoders for Modular Relational Deep Learning ECML KDD2026

链接: https://arxiv.org/abs/2606.21434
作者: Jakub Peleška,Gustav Šír
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted to ECML PKDD 2026 in Naples, Italy

点击查看摘要

Abstract:Relational Deep Learning (RDL) models multi-tabular databases as temporal heterogeneous graphs for end-to-end representation learning. While RDL is evolving rapidly, existing approaches face significant generalization obstacles. They are either schema-specific, requiring training from scratch for every new database, or they rely on monolithic architectures that entangle feature encoding with graph message-passing. Analyzing these limitations, we establish four core pillars for building foundational relational models: semantic granularity, structural topology, temporal causality, and unified optimization. Addressing these pillars, we propose a modular approach that decouples row encoding from graph message-passing. We introduce the Universal Row Encoder, a transformer-based module that integrates raw cell data with schema metadata - including column semantics, table names, and global distribution statistics - to produce table-width invariant row embeddings. By explicitly feeding global statistics to an intra-row self-attention mechanism, the encoder natively contextualizes unseen features and handles sparse data. Serving as a flexible “backend” for any downstream graph architecture, our pretrained encoder enhances cross-database knowledge transfer on the established RelBench benchmarks while improving learning convergence and memory footprint. Comments: Accepted to ECML PKDD 2026 in Naples, Italy Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2606.21434 [cs.LG] (or arXiv:2606.21434v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.21434 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-148] Federated Temporal Attention Intelligence for Cyber-Resilient IoMT: Lightweight Digital Twins and PPO-Driven Honeypot Deception

链接: https://arxiv.org/abs/2606.21422
作者: Syed Zeeshan Haider,Anwar Shah,Muneeb Arif,Hamza Iftikhar,Waqas Ali
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid proliferation of Internet of Medical Things (IoMT) devices introduces critical cybersecurity vulnerabilities in healthcare environments where resource-constrained medical devices operate under strict latency requirements and stringent data-privacy regulations. To address these challenges, this paper presents the Lightweight Digital Twin and Federated Reinforcement Learning (LDT-FRL) framework, a privacy-preserving defense architecture integrating four complementary mechanisms: a Temporal Attention Encoder (TAE) built on a GRU backbone with learned temporal self-attention for flow-level threat classification; lightweight LSTM-based Digital Twins trained on normal-class traffic to generate per-device anomaly scores that gate the TAE classifier through a learned sigmoid coupling; a Federated Proximal Policy Optimization (PPO) agent selecting among ALLOW, ISOLATE, and HONEYPOT_REDIRECT actions based on a seven-dimensional state; and an intelligent honeypot layer that converts redirected suspicious traffic into actionable threat intelligence. A federated aggregation strategy employing EMA-smoothed per-client validation losses as inverse-weighted FedAvg coefficients stabilizes global model updates under non-IID client distributions. Evaluated on CICDDoS 2019 and TON-IoT benchmarks, LDT-FRL achieves 99.66% and 99.95% test accuracy respectively, with macro-F1 scores of 0.9913 and 0.9995, converging 81% faster than the DTFL-CD baseline while attaining perfect F1=1.000 on the severely imbalanced MITM class. Explainability analysis via SHAP, LIME, Grad-CAM, and counterfactual methods confirms that the TAE focuses on semantically meaningful flow features, providing interpretable evidence for each defense decision.

[LG-149] One Size does not Fit All: Heterogeneous Latent Space Alignment for Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2606.21415
作者: Evangelia Koskinioti,Yi Shen,Georgios Stamou,Michael M. Zavlanos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain shift remains a major obstacle to the reliable deployment of machine learning models in high-stakes environments such as healthcare. While Domain adaptation aims to mitigate these effects, existing approaches suffer from limited expressiveness of latent representations and a reliance on handcrafted, static augmentations. In this work, we address these limitations by proposing a novel deep learning architecture for Unsupervised Domain Adaptation (UDA), specifically optimized for medical image segmentation. Our framework, ADualVUOT, integrates a dual-encoder Variational Autoencoder (VAE) with Continuous Normalizing Flows (CNFs) to increase modeling flexibility and posterior expressiveness. To achieve domain alignment, we leverage Unbalanced Optimal Transport (UOT) through the Gaussian-Gromov-Wasserstein (GGW) distance, which handles structural and topological discrepancies between domains. Furthermore, we incorporate an adversarial augmentation scheme to synthesize worst-case compositions, thus enhancing model robustness. Extensive experiments on medical imaging benchmarks show significant gains over prior OT-based approaches.

[LG-150] Atomistic Language Models Understand and Generate Materials

链接: https://arxiv.org/abs/2606.21395
作者: Sathya Edamadaka,Krithik Ramesh,Ju Li,Rafael Gómez-Bombarelli
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Atomistic structure and natural language have long been modeled separately, with language models either calling atomistic models as tools or being fine-tuned on lossy textual encodings that discard atomistic information. We introduce Atomistic Language Models (ALMs) to pursue native multimodality, in which a single language backbone understands atomistic structures, generates materials from natural language, and optimizes crystal structures as instructed by text. By unifying a pretrained atomistic encoder, large language model, and denoising diffusion model through purely continuous projectors and staged training, ALMs achieve state-of-the-art results on crystal structure prediction and de novo generation. ALMs are enabled by a continuous bridge that maps language model embeddings directly into the steering space of atomistic diffusion, and are assisted by Text-to-Crystal Feynman-Kac (T2C-FK), a particle-based sampler that scores partial denoising trajectories to enforce stoichiometric targets at inference time. To evaluate the ability of ALMs to optimize and generate materials from natural-language prompts and 3D atom-coordinate inputs, we introduce ALM Bench, the first benchmark for text-conditioned crystal generation and optimization. Code, training data, and model weights will be released soon.

[LG-151] Enhancing Creativity in 3D Generative Design via a TRIZ-Inspired Text-to-CAD Framework

链接: https://arxiv.org/abs/2606.21378
作者: Dongeon Lee,Leekyo Jeong,Soyoung Yoo,Sunwoong Yang,Namwoo Kang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated significant potential in supporting engineering design tasks, including computer-aided design (CAD) automation. However, most existing LLM-based 3D CAD generation approaches primarily focus on geometric precision and instruction-following performance, often overlooking the fundamental aspect of creative design exploration. This study presents a TRIZ-inspired text-to-CAD framework that leverages LLMs to generate high-quality, editable CAD models while systematically exploring creative design alternatives. The framework integrates the Theory of Inventive Problem Solving (TRIZ)-embedding deep human insights from extensive patent records-into LLM prompting strategies, enabling autonomous generation of innovative CAD variants that address technical contradictions. Through a comprehensive three-stage pipeline of design generation, enhancement, and optimization, the framework produces structurally diverse CAD models from well-crafted prompts. The present study implements and evaluates the first two stages, while positioning the design optimization stage as future work. A product design case study (chair) demonstrates that the TRIZ-inspired text-to-CAD framework generates multiple creative design alternatives by systematically applying TRIZ inventive principles such as segmentation, anti-weight, dynamics, and composite materials, achieving 4.0-14.7% mass reduction across all enhanced designs while maintaining structural integrity. The key findings suggest that integrating systematic innovation methodologies with LLM-based 3D CAD generation bridges the gap between precision-focused synthesis and creativity-focused exploration, advancing toward autonomous design systems where AI makes design decisions independently, supporting human decision-making in human-AI collaborative design for engineering applications.

[LG-152] NAC: Neural Action Codec for Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.21372
作者: Ahad Jawaid,Yu Xiang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models rely on discrete action tokenizers to bridge continuous robot control and autoregressive sequence modeling, yet existing tokenizers often trade off between compression, latency, and downstream performance. We revisit this design through the lens of neural audio codecs-convolutional encoder-decoder architectures with residual vector quantization that serve as the standard front end for audio foundation models. Motivated by their success, we introduce the Neural Action Codec (NAC), which treats short robot action trajectories as multi-channel 1D signals and compresses them using a multi-scale RVQGAN architecture. We observe that audio-specific mel-spectrogram objectives are ill-suited for kinematic signals; however, by replacing them with simple time-domain and non-mel spectral reconstruction losses, audio-codec-style models can autoencode actions with high fidelity without substantial architectural changes. NAC provides a compact, ordered token space via offset codebooks, enabling standard autoregressive policies to operate over short, structured sequences. Meanwhile, a Vocos-style decoder with an ISTFT head and adversarial discriminators recovers smooth, detailed trajectories. Across LIBERO-10, RoboMimic, and a suite of real-world manipulation tasks, NAC achieves lower reconstruction error and higher success rates than binning, FAST, and prior VQ-based tokenizers at comparable or better compression rates. These results demonstrate that repurposed neural audio codecs offer a strong, practical backbone for learned action tokenization in modern VLAs.

[LG-153] Predictive Repair Management Using a Multi-Head Attention Transformer and Online Learning

链接: https://arxiv.org/abs/2606.21364
作者: Xinyao Zhang,Willie Cade,Karl R. Haapala,Arun Natarajan,Sara Behdad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of repair duration is an important challenge in product maintenance due to its implications for resource allocation, customer satisfaction, and operational performance. This study aims to develop a deep learning framework to help fleet repair shops accurately categorize repair time given product historical data. The study uses an automobile repair and maintenance dataset and creates an end-to-end predictive framework by employing a multi-head attention network designed for tabular data. The developed framework combines categorical information, transformed through embeddings and attention mechanisms, with numerical historical data to facilitate integration and learning from diverse data features. A weighted loss function is introduced to overcome class imbalance issues in large datasets. Moreover, an online learning strategy is used for continuous incremental model updates to maintain predictive accuracy in evolving operational environments. Our empirical findings demonstrate that the multi-head attention mechanism extracts meaningful interactions between vehicle identifiers and repair types compared to a feed-forward neural network and a random forest model. Also, combining historical maintenance data with an online learning strategy facilitates real-time adjustments to changing patterns and increases the model’s predictive performance on new data. The model is tested on real-world repair data spanning 2013 to 2020 and achieves an accuracy of 78%, with attention weight analyses illustrating feature interactions.

[LG-154] Urban Power Grid Topology and Hierarchy Identification from Open Data

链接: https://arxiv.org/abs/2606.21352
作者: Shiliang Zhang,Sabita Maharjan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the complex topology and hierarchy of urban power grid is crucial for energy prognosis, power flow management, and system resilience analysis. However, detailed grid information remains largely proprietary. This creates significant barriers for research and innovation, especially when analyzing the last-mile distribution networks connecting individual buildings. This paper addresses this challenge by developing an open-data-driven framework for the complete identification of urban power grid topology, from high-voltage transmission down to individual building connections. Particularly, we fuse public infrastructure data (power-lines, substations, transformers, poles) to map the high and medium-voltage skeleton using graph-based algorithms. We then leverage geospatial machine learning on OpenStreetMap building data to group power demand clusters, and infer the physical topology of the final distribution lines linking the clustered buildings. We apply the developed framework to the district of Alna in Oslo, Norway, and we reconstruct the complete grid topology that connects 7,330 buildings and all major electricity infrastructure assets. With the research in this work, we provide a critical tool that facilitates power system analysis, e.g., power flow optimization, cascading failure simulation, and grid resilience against the penetration of distributed renewable generation.

[LG-155] A Reward-Petri-Net Interpretation of Temporal Behavior Trees

链接: https://arxiv.org/abs/2606.21350
作者: Till Schmeil,Günther Waxenegger-Wilfing,Sebastian Schirmer
类目: Machine Learning (cs.LG)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:This paper introduces an interpretation of Temporal Behavior Trees (TBTs) as Reward-Petri-Nets (RPNs) for reinforcement learning (RL). Designing reward functions for complex, long-horizon robotic tasks is notoriously difficult, especially when tasks have hierarchical structure and temporal constraints. TBTs extend conventional behavior trees (BTs) used in robotic applications by incorporating temporal properties into their leaf nodes. This allows TBTs to represents not only the behavioral task structure defined by BT operators such as Sequence, Fallback, and Parallel, but also the task’s temporal constraints. In this work, the constraints are specified in the leaf nodes using Linear Temporal Logic. In order to inform RL rewards using TBTs, we provide a translation from TBT into a Petri Net (PN) and show how rewards can be automatically assigned based on the TBT’s structure, resulting in a RPN. In a series of increasingly challenging environments, we demonstrate how TBT-based rewards enable learning where vanilla RL fails, improve sample efficiency, and offer flexible, intuitive control over the learning progress. We showcase the learning impact by using different reward distribution schemes and TBT structures.

[LG-156] Direct Raw Audio Signal Processing via Reservoir Computing: An Investigation into Feature-Free Architectures

链接: https://arxiv.org/abs/2606.21335
作者: Rinku Sebastian,Simon O Keefe,Martin A Trefzer
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper evaluates Reservoir Computing (RC) as an autonomous, ‘feature-free’ framework for audio processing, designed to eliminate traditional, handcrafted feature extraction stages. We investigate whether the high-dimensional temporal dynamics inherent in a reservoir can function as a robust end-to-end processor for the direct classification of raw acoustic signals. By bypassing computationally intensive representations like MFCCs, this approach seeks to mitigate significant intellectual and pre-processing bottlenecks in traditional signal pipelines. Our study evaluates and compares shallow, sequential, and parallel deep reservoir architectures to determine their capacity for hierarchical feature representation. Experimental results demonstrate that the proposed parallel approach consistently outperforms shallow and sequential baselines while maintaining low model complexity. These findings highlight the potential of RC as an efficient and scalable alternative for time-domain audio processing, offering a promising pathway toward deployable, low-power acoustic systems with minimal preprocessing requirements.

[LG-157] MedTS-TTT: Test-Time Training for Medical Time Series Classification MICCAI2026

链接: https://arxiv.org/abs/2606.21329
作者: Mingzhi Chen,Yiyu Gui,Guibo Luo
类目: Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:Medical time series (MedTS) signals such as electroencephalography (EEG) and electrocardiography (ECG) support many clinical applications. However, substantial subject-level heterogeneity often induces subject-level distribution shift, causing a fixed parameter set to generalize poorly to unseen individuals. Compared with domain adaptation methods that often depend on extra adaptation components or target-batch statistics, Test-Time Training (TTT) provides a more practical solution for sequential clinical data by enabling online adaptation from unlabeled test samples. However, many representative TTT methods require iterative inner-loop optimization, increasing test-time overhead. In this paper, we propose MedTS-TTT, a test-time training framework for medical time series modeling. MedTS-TTT is built upon Closed-Loop Self-Alignment Test-Time Training (CLSA-TTT) and a Gated Convolutional Backbone (GCB). CLSA-TTT constructs a token-level self-supervised target and performs a single-step fast-weight update for intra-layer closed-loop alignment, enabling rapid sample-wise adaptation without iterative inner-loop optimization. GCB combines CLSA-TTT-based fast adaptation and token-level fusion with a gated convolutional branch to balance local dynamic modeling and information-flow control. On 4 public datasets (2 EEG and 2 ECG) with subject-independent splits, MedTS-TTT achieves 11 top-1 rankings out of 12 evaluations across 9 baselines and 3 metrics. The code is publicly available at this https URL.

[LG-158] Sea-Scan: High-Accuracy ML-based Dark Vessel Detection and Localisation via Weakly Supervised DAS Monitoring

链接: https://arxiv.org/abs/2606.21326
作者: Tian Tian,Agastya Raj,Lara Flanagan,John Kennedy,Marco Ruffini
类目: ound (cs.SD); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper is accepted for presentation at ECOC 2026

点击查看摘要

Abstract:We present an ML-based vessel detection and localization system, trained with weak supervision from imperfect AIS labels, that achieves a 97.8% detection rate at 1.98% false-trigger rate, successfully identifies dark-vessel events from unlabeled data.

[LG-159] Distinguishing indistinguishable attractors: Unsupervised anomaly detection with reservoir computers

链接: https://arxiv.org/abs/2606.21322
作者: Davide Prosperino,Haochun Ma,Christoph Räth
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 21 pages, 8 figures, 1 table

点击查看摘要

Abstract:Detecting when a nonlinear dynamical system departs from its normal regime is a recurring problem across the sciences, from cardiology to climate and energy systems. We show that a very simple Kolmogorov–Smirnov test on the output weights of a reservoir computer is highly sensitive to regime changes in nonlinear dynamical systems, including those invisible to both classical nonlinear measures and modern deep-learning detectors. The core idea of our algorithm is to treat the readout layer of a reservoir computer as a representation of the input dynamics. Since the input mapping and the reservoir itself are random and fixed, the trained output weights are the only object encoding the system at hand. We summarize this fingerprint by the empirical cumulative distribution function of the readout weights and compare it to a reference band built from the training data. This unsupervised, online detector distinguishes two visually indistinguishable butterfly-shaped attractors, resolves parameter drifts seven times smaller than the strongest deep-learning baseline, flags noise four orders of magnitude below the signal, and identifies ventricular flutter in a clinical ECG recording. More broadly, we aim to establish a perspective on reservoir computers in which the trained output weights are treated as a representation of the learned system in their own right, rather than merely as a means to forecasting.

[LG-160] Objective-Behavior Alignment: Diagnostics for MORL Policy Selection

链接: https://arxiv.org/abs/2606.21321
作者: Antonio Mone,Zuzanna Osika,Florian Felten,Pradeep K. Murukannaiah,Mark Fuge,Frans A. Oliehoek,Luciano Cavalcante Siebert
类目: Machine Learning (cs.LG)
*备注: 22 pages, 41 figures

点击查看摘要

Abstract:Real-world decision-making often requires optimizing multiple competing objectives simultaneously. In reinforcement learning (RL), this is typically addressed by combining reward signals into a single scalar objective via a scalarization function, which can be fragile: small changes in the weights can induce drastically different policies. Multi-objective reinforcement learning (MORL) instead produces sets of policies that explicitly represent trade-offs between objectives. However, these policies are typically presented to the decision maker only through their value vectors, which can obscure substantial behavioral variation: policies that induce distinct trajectories may appear indistinguishable when evaluated solely by expected returns. We propose an exploratory diagnostic workflow that automatically highlights behavioral variation along the Pareto front that objective values alone do not reveal, providing both quantitative and visual tools to support policy inspection. We validate our approach on simple grid examples and scale it to continuous control benchmarks, demonstrating that it remains effective as problem complexity increases.

[LG-161] Reconstructing Randomly Masked Spectra Helps DNNs Identify Discriminant Wavenumbers

链接: https://arxiv.org/abs/2606.21289
作者: Yingying Wu,Jinchao Liu,Yan Wang,Stuart Gibson,Margarita Osadchy,Yongchun Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nondestructive detection methods, based on vibrational spectroscopy, are vitally important in a wide range of applications including industrial chemistry, pharmacy and national defense. Recently, deep learning has been introduced into vibrational spectroscopy showing great potential. Different from images, text, etc. that offer large labeled data sets, vibrational spectroscopic data is very limited, which requires novel concepts beyond transfer and meta learning. To tackle this, we propose a task-enhanced augmentation network (TeaNet). The key component of TeaNet is a reconstruction module that inputs randomly masked spectra and outputs reconstructed samples that are similar to the original ones, but include additional variations learned from the domain. These augmented samples are used to train the classification model. The reconstruction and prediction parts are trained simultaneously, end-to-end with back-propagation. Results on both synthetic and real-world datasets verified the superiority of the proposed method. In the most difficult synthetic scenarios TeaNet outperformed CNN by 17%. We visualized and analysed the neuron responses of TeaNet and CNN, and found that TeaNet’s ability to identify discriminant wavenumbers was excellent compared to CNN. Our approach is general and can be easily adapted to other domains, offering a solution to more accurate and interpretable few-shot learning.

[LG-162] Reward-free Pretraining for Reinforcement Learning via Occupancy Coverag e Maximization

链接: https://arxiv.org/abs/2606.21271
作者: Marco Pratticò,Pietro Novelli,Massimiliano Pontil,Carlo Ciliberto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse rewards pose a central challenge in reinforcement learning, since agents receive no informative signal until they reach their goal. Intrinsic-reward methods address this issue by optimizing non-stationary objectives such as novelty, prediction error, or skill diversity, thereby injecting a supervision signal into the problem. While effective, these methods often require that the extrinsic (sparse) reward can be evaluated – either online or during offline relabeling of the stored transitions. This limitation is particularly vexing for multi-task, meta-, and continual reinforcement learning, where agents’ interactions with the environment are usually reward-free. In this work, we present a method to pre-train transferable exploration policies that rapidly adapt to sparse rewards at downstream task time. Our objective maximizes state-space covering for the occupancy measure, and can be framed in terms of entropy maximization. Its algorithmic implementation, ROVER, leverages recent advances on the operatorial formulation of RL to estimate occupancy with a learned resolvent world model, bypassing common hurdles associated with density and entropy estimation. ROVER further introduces a virtual “sink” state for unexplored regions, balancing coverage of known states with expansion into unseen ones and preventing cyclic expansion-collapse behavior during learning. In tabular and pixel-based sparse navigation tasks, ROVER produces more uniform aggregate coverage and stronger initializations for downstream tasks than standard reward-free baselines.

[LG-163] Intrinsic Flow Matching on Quantum Pure-State Manifolds with Phase-Aligned Transport

链接: https://arxiv.org/abs/2606.21256
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum pure-state ensembles live on complex projective space, making flat Euclidean generative modeling geometrically mismatched. We introduce Intrinsic Flow Matching (IFM), a deterministic transport framework on \mathbbCP^d-1 that learns tangent velocity fields using Pancharatnam phase-aligned conditional paths. IFM replaces local score teachers and reverse-time stochastic sampling with manifold probability flow, while horizontal parameterization removes redundant ambient directions. We show that the IFM objective recovers the induced marginal transport field, represents deterministic projective ensemble flows, and yields endpoint and stability guarantees. Empirically, IFM often improves over ambient Euclidean flow matching across higher-qubit, multimodal, spin-coherent, physics-inspired, and amplitude-encoded MNIST image-vector benchmarks, with strongest gains on high-dimensional and coherence-sensitive tasks but not uniformly across every metric.

[LG-164] Gradient-Free Warm-Start Library Recovery: an Amortized-Regret Separation

链接: https://arxiv.org/abs/2606.21253
作者: Jianwei Lou(RailMind Systems, Neuss, Germany)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:Continual learning that is gradient-free, local, online, and append-only is attractive for edge and streaming deployment, but its value is usually argued informally. We give a provable account on recurring-regime streams. Given segmentation, a warm-start library learner attains amortized recovery cost O!\big(KD/\varepsilon^2+(R-K)\logK/\Delta^2\big) versus a memoryless re-estimator’s \Theta(RD/\varepsilon^2) , an advantage (R-K),\Theta(D/\varepsilon^2) growing with dimension D and recurrence density. The mechanism is a decoupling: recognizing which of K seen regimes is active costs O(\log K/\Delta^2) , independent of D , whereas estimating a regime costs \Theta(D/\varepsilon^2) . We prove this is tight: matching lower bounds give recognition \Theta(\log K/\Delta^2) and a memoryless-class bound \Omega(RD/\varepsilon^2) , so each term is individually minimax-tight (the joint statement is conditional). The separation is born-immune (a memoryless learner’s advantage is identically zero) and paradigm-level: it matches, and does not beat, a fair spawn-capable Bayesian baseline; the contribution is attaining this cost structure without end-to-end backprop and with zero forgetting by construction. A count-calibrated variant ties the baseline’s leading constant up to a bounded, never-negative per-recurrence overshoot, hyperparameter-free and with no per-step transcendentals. We bound the scope: recognizable regimes are capped by simplex packing (walls e^\Theta(D) ); autonomous segmentation is impossible at the packing wall (no detector escapes the false-alarm/delay frontier as regimes overlap); the advantage vanishes under overlap. The dimension-dependent separation is corroborated on synthetic streams and real k -mer genome distributions (memoryless cost \propto D^1.04 , recognition D -independent); the one real sequential stream sits in the D=1 near-null corner.

[LG-165] Comparative Evaluation of Machine Learning and Deep Learning Models for Wound-Rotor Synchronous Motor Performance Prediction

链接: https://arxiv.org/abs/2606.21230
作者: Kıvanç Doğan,Ahmet Orhan
类目: Machine Learning (cs.LG)
*备注: This paper was presented on April 19, 2026, at the 10th ISPEC International Congress on Modern Scientific Research

点击查看摘要

Abstract:Wound rotor synchronous motors have emerged as a strong alternative that eliminates dependence on REEs. However, WRSM design requires the simultaneous optimization of numerous geometric and electromagnetic parameters, and the high computational cost of conventional finite element analysis severely limits the rapid exploration of the large parameter space. Although there are machine-learning-based surrogate modeling studies in the literature, they generally compare only a limited number of models, exclude deep learning architectures, and do not provide a comprehensive benchmark specific to WRSM. In this study, the performance of a total of eight machine learning and deep learning models from four different algorithmic families was systematically compared for the prediction of WRSM torque and motor efficiency. On a dataset of 3351 samples generated using Latin Hypercube Sampling in the Motor-CAD simulation environment, each model was trained with 10 different random seed values and tuned via Optuna hyperparameter optimization. Different from the existing literature, this study jointly offers a broad model spectrum including recent deep learning architectures such as FT Transformer, a multi-seed reproducibility protocol, and a Pareto analysis of the computational cost-accuracy trade-off. The results revealed that neural-network-based models systematically outperform tree-based models. The FT-Transformer model achieved the highest single-model accuracy with R^2 = 0.9928, producing predictions in 0.33 milliseconds and thus obtaining several orders of magnitude speedup compared to FEA. Model performances were evaluated in a multidimensional manner using R^2, MAE, RMSE, and MAPE metrics.

[LG-166] Sakana Fugu Technical Report

链接: https://arxiv.org/abs/2606.21228
作者: Yujin Tang,Edoardo Cetin,Jinglue Xu,Qi Sun,Stefan Nielsen,Vincent Richard,Haruto Goda,Iaroslav Tymchenko,Nhan Nguyen,Hyunin Lee,Mari Ashiga,Shashank Kotyan,So Kuroki,Tarin Clanuwat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The capabilities of frontier Large Language Models (LLMs) continue to advance, with different providers increasingly specializing in distinct domains. This raises a natural next objective: how to combine the individual specializations of various LLMs into a collectively intelligent system. To this end, we report the development of Sakana Fugu, a family of orchestrator models that harness and amplify the capabilities of an LLM agent team. Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state-of-the-art results compared to other publicly accessible models across a range of challenging tasks, including SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity’s Last Exam, and CharXiv Reasoning. We release two models: Fugu, which balances performance with latency for everyday use, and Fugu-Ultra, which prioritizes answer quality on the hardest problems. We describe our training paradigm, which encompasses large-scale fine-tuning, evolutionary algorithms, and reinforcement learning approaches, along with the infrastructure and core design principles that turn these methods into a production system. We hope this report encourages further research into multi-agent systems and dynamic, query-adaptive agentic scaffolds as a path toward the next frontier of AI capabilities, accessed through collective intelligence.

[LG-167] DCD-PFN: A Decoupling-Aware Foundation Model for Causal Discovery

链接: https://arxiv.org/abs/2606.21212
作者: Zhengkang Guan,Yikang Chen,Yi He,Yunze Tong,Zijing Hu,Haoyuan Qian,Fei Wu,Kun Kuang
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Causal discovery is critical for understanding complex data-generating mechanisms, yet traditional algorithms often struggle with highly non-linear and noisy systems, or suffer from severe computational bottlenecks. Recent tabular foundation models based on Prior-Data Fitted Networks (PFNs) have demonstrated remarkable zero-shot inference capabilities, but their potential for explicit structural causal discovery remains underexplored. To bridge this gap, we propose DCD-PFN, a decoupling-aware foundation model for causal discovery. Instead of directly amortizing global graph reconstruction, DCD-PFN focuses on local causal discovery through a decoupling-based paradigm. Through pre-training on diverse synthetic Structural Causal Models (SCMs), the model learns sample-wise decoupling weights that enable Markov boundary (MB) identification. Furthermore, by leveraging parallelized local discovery, DCD-PFN efficiently reconstructs global causal graphs while remaining grounded in the theoretical foundations of decoupling-based causal discovery. Experiments demonstrate that our foundation model achieves robust zero-shot generalization.

[LG-168] Rejections Based on Predictive Uncertainty Enable Reliable Routine Soil Spectroscopy

链接: https://arxiv.org/abs/2606.21179
作者: Jonas Schmidinger,Robin Gebbers,Marc-Olivier Gasser,Viacheslav Barkov,G.Mick Wu,Viacheslav I. Adamchuk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Soil properties relevant to agricultural and environmental applications are conventionally measured using elaborate laboratory methods involving physical and chemical processing. While highly accurate, these conventional methods are costly and time-consuming. In contrast, optical spectroscopy paired with machine learning enables rapid and cost-effective predictions of multiple soil properties. However, spectroscopic modelling is often considered unreliable, as the predictive accuracy varies between soil properties and individual samples. To balance this trade-off between cost and reliability, we introduce reject-to-remeasure: an AI-based measurement framework that combines probabilistic modelling with uncertainty-guided rejection. In this framework, soil samples are first analysed using spectroscopy, after which predictions are rejected if their predictive uncertainty exceeds predefined quality constraints. Rejected samples are subsequently remeasured using conventional laboratory procedures. On a regional visible-near-infrared spectral soil library from Québec, we demonstrate that reject-to-remeasure with modern foundation models (TabPFNv2.5 and TabICLv2) can facilitate the integration of optical spectroscopy into routine laboratory workflows while meeting user-defined accuracy requirements and reducing measurement costs.

[LG-169] Dead-Direction Signatures: A Cheap Spectral Reading of Singular Complexity

链接: https://arxiv.org/abs/2606.21158
作者: Tejas Pradeep Shirodkar,P. J. Narayanan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 42 pages, 10 figures, 6 tables. Empirical companion to arXiv:2606.05957

点击查看摘要

Abstract:Singular learning theory characterises the complexity of a deep network through the geometry of its loss singularities. The local learning coefficient (LLC), the standard estimator of Watanabe’s real log canonical threshold (RLCT, \lambda ), reads this geometry as an integrated Bayesian scalar through SGLD, which needs per-task calibration and 10^4 - 10^6 forward-backward passes per checkpoint. We introduce Dead-Direction Signatures (DDS), a family of cheap closed-form spectral readings of singular structure: each reads a network’s activation matrix or per-sample-gradient Fisher-Gram at a chosen layer, replacing the SGLD posterior chain with spectral linear algebra. The readings rest on a dead-direction framework that predicts a structural correlation between activation- and Fisher-side spectra at any singular minimum, and a rank-multiplicative volume identity that single-eigenvalue monitors cannot produce: the active-volume \log\det^+(G) slope counts the dead directions, tracking the rank-deficit r across r \in \1,2,3,4\ (slope ratios 2.0, 3.1, 4.0 at r=2,3,4 against the predicted 2,3,4 ), where the smallest eigenvalue is rank-blind. On reduced-rank regression with closed-form \lambda , calibrated LLC recovers \lambda at 99% mean and the DDS observables rank-track it at the framework-predicted sign; on a non-linear modular-addition transformer DDS separates d_\mathrmmodel across eighteen orders of magnitude where calibrated LLC at the protocol budget is rank-flat. Complementary to LLC’s integrated posterior reading, DDS gives a directional, layer-local handle on a network’s dead directions, read in closed form from its activation and gradient spectra.

[LG-170] Horizon Adaptive Offline Policy Learning via Value Stitching

链接: https://arxiv.org/abs/2606.21136
作者: Kexin Zheng,Xianyuan Zhan,Xintao Yan
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning accurate value functions plays a decisive role for reinforcement learning (RL) agents to solve long-horizon, complex tasks. Conventional temporal-difference (TD) learning objectives suffer from value-estimation bias that accumulates over the horizon, while extended-horizon modeling methods, such as n-step TD backups and Q-chunking, adopt a rigid, fixed-horizon value-modeling recipe that is often not flexible enough to capture complex value structures in long-horizon, multi-stage tasks. In this paper, we show that enabling value updates with dynamic horizon composition can yield a strong offline policy learning scheme. Our method, Horizon Adaptive Offline Policy Learning via VAlue STitching (VAST), replaces fixed-horizon backups with recursive, horizon-adaptive value composition. Its key ingredient is to couple value optimization with a future state- and horizon-length-conditioned auxiliary value function that is learned through direct data supervision, and a stitching policy that optimally selects the reward-maximizing horizon length and future sub-goal to achieve horizon-adaptive value stitching. This design enables direct estimation and compositional “stitching” of variable-length returns grounded in actionable sub-goal states, providing an accurate and greedily exploitable value-supervision signal for offline policy optimization. Across 50 tasks on OGBench, VAST outperforms fixed-step, extended-horizon methods, and generative-value offline RL baselines, achieving strong performance particularly in high-complexity, long-horizon decision-making tasks.

[LG-171] What Accuracy and Gradient Cosine Miss: Evaluating Feedback Alignment via Scale Stability Reference Validity and Depth Utility

链接: https://arxiv.org/abs/2606.21126
作者: Yuren Hao,Xiang Wan,ChengXiang Zhai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12+10 pages

点击查看摘要

Abstract:Despite the success of deep learning, training deep networks in biologically plausible and hardware-efficient ways remains an open challenge. Feedback alignment (FA) methods address this by replacing backpropagation’s symmetric backward weights with fixed random matrices, but their effectiveness depends critically on whether they can be accurately evaluated. The standard evaluation relies on two quantities: task accuracy and cosine similarity between the method’s credit signal and the backpropagation gradient. We show that this reporting pair is insufficient by identifying two independent failure modes, both silent under current reporting: (1) measurement degeneracy, where the BP reference gradient collapses to the numerical floor in terminal-LayerNorm residual architectures, rendering cosine uninterpretable; and (2) aggregation collapse, where the aggregate cosine masks layerwise heterogeneity that concentrates credit at one end of the network. To address these limitations, we propose a diagnostic evaluation protocol based on three checks – scale stability, reference validity, and depth utility – together with per-layer rather than aggregate cosine reporting. Across multiple architectures and methods, the standard reporting pair gives no signal of failure in any audited case, while our protocol identifies all failures with wide calibration margins. The two failure modes are causally independent: a per-block scale penalty alleviates Mode 1 (residual scale explosion driving reference collapse) without affecting Mode 2 (cosine ranking that contradicts every functional metric we measured). Identifying these silent failures prevents researchers from building on non-functional credit assignment and provides actionable guidance for developing FA methods that genuinely train deep layers.

[LG-172] Enhancing Differentially Private Mechanisms via Empirical Bayes

链接: https://arxiv.org/abs/2606.21107
作者: Minwoo Kim,Junyong Park,Sungkyu Jung
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Methodology (stat.ME)
*备注: 27 pages. Under review

点击查看摘要

Abstract:Differential privacy (DP) has become the gold standard for ensuring the privacy protection of machine learning and statistical algorithms in recent decades. A plethora of algorithms and methods have been developed to enhance the utility of DP algorithms while maintaining the same level of DP. However, these are often overly complex or computationally ineffective. We propose a novel approach focusing on denoising the output of the simple additive Gaussian mechanism by adopting the idea of \textitempirical Bayes estimation. We highlight that the empirical Bayes approach can reduce the mean-squared error solely by taking the output of the Gaussian mechanism as input. Our numerical studies show that this simple yet powerful approach can be applied to improve upon various statistical problems, including histogram release, principal component analysis, and linear regression, often outperforming existing private algorithms.

[LG-173] BASIL: Bayesian Application for Scientific Iteration and Learning

链接: https://arxiv.org/abs/2606.21092
作者: Kelvin P. Idanwekhai,Valeriia Kaneva,Stefano Menegatti,Alexander Tropsha
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:We introduce BASIL, a user-friendly desktop application for process optimization. BASIL employs a Bayesian approach, incorporating special acquisition functions that can be used to solve both single and multi-objective optimization problems. It provides a graphical interface that enables users to input their experimental parameters, optimization objectives, and legacy data. This is then used to build surrogate models, which are coupled with acquisition functions to guide and optimize a process towards a desired objective. To facilitate model building, BASIL provides a variety of predefined surrogate model templates. BASIL can be used to optimize any arbitrary experiment or process with known, user-defined input variables, optimization objectives, and defined output.

[LG-174] OVIG: Optimistic Verification of AI Training Integrity via Gradient Signals

链接: https://arxiv.org/abs/2606.21045
作者: Hongxu Su,Jianzhu Yao,Huan Zhang,Xuechao Wang,Pramod Viswanath
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures, 11 tables. Submitted to IEEE Symposium on Security and Privacy (SP)

点击查看摘要

Abstract:The rapid growth of AI has increased the demand for domain-specific post-training, while the cost and specialization of accelerator infrastructure push many model owners to outsource this process. Outsourced training lowers operational barriers, but creates a training-integrity gap: the owner receives a checkpoint, logs, and aggregate metrics without direct evidence that the declared training trajectory was faithfully executed. An untrusted provider may have incentives to deviate from that trajectory, either to save computation or to introduce targeted security risks. Auditing such deviations is difficult because floating-point execution on heterogeneous accelerators introduces benign numerical drift, making it hard to distinguish honest replay differences from integrity violations. Existing verification methods either observe training at too coarse a granularity or impose costs and deployment constraints that are impractical at scale. We present OVIG, an optimistic verification framework that audits outsourced post-training using an empirical boundary on gradient differences calibrated from honest heterogeneous replays. OVIG checks opened intervals against this boundary and combines optimistic sampling with a stride parameter s , which partitions training into stride-aligned intervals and retains only interval-endpoint evidence. Across shortcut training attacks and targeted manipulation attacks, OVIG maintains 0% ASR on language, vision, and diffusion workloads. On Qwen3, increasing the stride from s=1 to s=2000 reduces off-chain storage and evidence transmission by 1996\times while preserving 0% ASR; at this setting, OVIG incurs only 1.143\times total system overhead relative to training without verification. These results show that OVIG provides a practical integrity layer for outsourced AI post-training under heterogeneous execution.

[LG-175] Demystifying Numerical Instability in LLM Inference: Achieving Reproducible Inference for Mission-Critical Tasks with HEAL

链接: https://arxiv.org/abs/2606.21023
作者: Zhenting Zhu,Lucas Thai,Shan Yu,Yicheng Liu,Yifan Qiao,Chenxi Wang,Harry Xu,Junyi Shu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) deploy into mission-critical domains (e.g., finance, medicine, and law), output reproducibility has become a strict system requirement. While practitioners use greedy decoding to eliminate algorithmic stochasticity, empirical deployments with 16-bit precisions still exhibit catastrophic output divergence across heterogeneous GPUs. Through SASS-level profiling, we reveal that this inconsistency is fundamentally driven by truncation errors introduced during downcasting at kernel boundaries. However, achieving reproducibility via a global FP32 pipeline incurs prohibitive system penalties: bypassing 16-bit hardware accelerators hurts compute efficiency, while upcasting the KV cache doubles memory overhead. To bridge this gap, we propose Hybrid Error ALleviation (HEAL), a targeted intervention that approximates FP32 precision while resolving hardware constraints through two targeted mechanisms. First, recognizing that floating-point formats underutilize their bit-width for Q, K, V tensors, HEAL applies INT16 quantization that preserves numerical stability without expanding the KV cache footprint. Second, HEAL synthesizes high-precision matrix multiplications via an algebraic error compensation strategy, executing entirely on high-throughput 16-bit Tensor Cores. To evaluate our approach practically, we introduce MCR-Bench, a benchmark targeting reproducibility in mission-critical tasks. HEAL achieves the same level of reproducibility on downstream tasks as the FP32 baseline while reducing the performance overhead by up to 7.1x.

[LG-176] Continuous-Time Probabilistic Correctors for Uncertainty-Aware Physics-Based Spacecraft Trajectory Forecasting

链接: https://arxiv.org/abs/2606.21021
作者: Muhammad Bilal Shahid,Zhanhong Jiang,Soumik Sarkar,Cody Fleming
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-horizon spacecraft trajectory forecasting suffers from error accumulation due to the absence of corrective observations in the forecast regime, making reliable uncertainty estimation crucial for safety-critical decision-making such as space domain awareness and conjunction assessment. While high-fidelity physics-based orbit propagators provide accurate deterministic forecasts, they typically lack calibrated uncertainty estimates over long horizons. We introduce a Predictor–Corrector framework in which a physics-based continuous-time \textitdeterministic forecaster is augmented with a learned continuous-time \textitprobabilistic Corrector that models forecast errors. The proposed Corrector can be wrapped around an existing deterministic propagator to improve forecast accuracy while producing sharp and calibrated full-covariance uncertainty estimates. The Corrector is based on Latent Neural Controlled Differential Equations (Latent NCDEs) and models the probabilistic temporal evolution of forecast errors in continuous time, naturally supporting irregular sampling and missing features. We further introduce a loss function that promotes calibration and sharpness in long-horizon uncertainty propagation. We evaluate the proposed framework on long-horizon spacecraft trajectory forecasting using real-world data from NASA’s Crustal Dynamics Data Information System (CDDIS), wrapping the Corrector around NASA’s General Mission Analysis Tool (GMAT). Across forecast horizons of 2–4 days without observations and six rolling test windows, the proposed approach consistently improves accuracy and uncertainty calibration compared to deterministic baselines and Latent ODE-based correctors, demonstrating the effectiveness of the continuous-time probabilistic Corrector for trajectory forecasting.

[LG-177] Inductive Generalization for Robotic Manipulation

链接: https://arxiv.org/abs/2606.20999
作者: Annabella Macaluso,Haochen Zhang,Ishaan Masilamony,Yingshan Chang,Yonatan Bisk
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the generalization capabilities of visuomotor policies is essential in the development of capable robotic agents. Generalizable models learn structures that transfer across domains. However, in practice, visuomotor policies test performance by interpolation on known distributions using unstructured domain shifts (e.g. lighting, clutter, diverse objects). We argue that to measure generalization capabilities we must instead test the inductive capacity of policies on progressively harder, out-of-distribution task variants. We call this inductive generalization, drawing directly on how axis-based evaluation has revealed inherent generalization limitations in language models (e.g. sequence length, counting) arXiv:2502.00197 . We provide a reusable and formal evaluation protocol for measuring inductive generalization in any manipulation policy, and establish baselines showing that existing paradigms fail this test; e.g. SoTA Vision-Language-Action models and find that policies that appear to generalize to prior domain shifts (distractors, etc) fail inductive generalization tests. These results expose a class of learning challenges orthogonal to those addressed by data and model scaling in robot learning, yet are imperative to solve in order to realize general purpose robots.

[LG-178] Physics-Guided Fully Convolutional Spatiotemporal Learning Toward Digital-Twin-Enabled Microstructure Evolution Prediction

链接: https://arxiv.org/abs/2606.20983
作者: Michael Trimboli,Wenxi Liu,Xianqi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and predicting microstructure evolution is central to materials design, yet purely data-driven spatiotemporal learning models often suffer from limited physical consistency and degraded long-term prediction accuracy. In this work, we introduce a physics-guided fully convolutional spatiotemporal learning framework for microstructure evolution prediction. Unlike prior self-supervised approaches, the proposed method explicitly incorporates governing physical equations into the training objective, thereby encouraging the learned dynamics to remain consistent with known thermodynamic and kinetic laws. This physics-guided formulation improves predictive accuracy, long-horizon stability, and robustness across spatial resolutions and temporal prediction settings. Extensive experiments for spinodal decomposition demonstrate that incorporating physics-guided residual regularization leads to more faithful reproduction of microstructural morphology, statistics, and evolution trends compared with purely data-driven baselines. The proposed framework preserves the scalability and computational efficiency of fully convolutional architectures while bridging the gap between high-fidelity physics-based simulations and data-driven surrogate modeling, offering a reliable and efficient surrogate-modeling step toward digital-twin-enabled microstructure evolution prediction.

[LG-179] Formalizing Task-Space Complexity for Zero-Shot Generalization

链接: https://arxiv.org/abs/2606.20967
作者: Jung-Hoon Cho,Heling Zhang,Siqi Du,Roy Dong,Cathy Wu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Policies must operate across diverse conditions, yet a single policy is often conservative while fully adaptive schemes can be complex. We study zero-shot generalization in contextual dynamical systems and introduce a performance-centric, directional task dissimilarity–the signed divergence–that upper bounds the generalization gap from a source context to a target context. The signed divergence induces \varepsilon -tolerance sets that certify when a source policy class generalizes, and it yields a concrete notion of task-space complexity: the minimum number of source contexts needed so that every target context incurs at most \varepsilon generalization gap. Under a mild local smoothness assumption on performance, the induced tolerance sets admit certified inner/outer balls and instance-dependent volume bounds on task-space complexity. In the finite-oracle setting, source selection reduces to set cover; a greedy strategy inherits the standard H(n) approximation guarantee. Using a Mass-Spring-Damper system with linear-quadratic regulator (LQR) controllers and a nonlinear CartPole system with deep reinforcement learning controllers, we show that greedy selection achieves the same \varepsilon -coverage with fewer policies than uniform or random baselines. Our approach delivers a performance-based task similarity measure and practical certificates for building generalizable control with simple policies.

[LG-180] Equilibrium with Internal Transfers

链接: https://arxiv.org/abs/2606.20960
作者: Mingyang Liu,Gabriele Farina,Asuman Ozdaglar
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:Nash equilibrium (NE) arises from selfish utility maximization, yet its social welfare can be arbitrarily far from optimal. Moreover, computing an NE is intractable in general. We study augmented game models in which players use budget-balanced internal transfers to improve incentives before play. We first introduce \emphSelf-Enforcing Transfer Equilibrium (SETE), where players commit to nonnegative peer-to-peer transfers that are paid only if the recipient does not deviate from a prescribed strategy. For polymatrix games, we show that every stationary point of the social welfare function, in particular any socially optimal strategy profile, can be sustained as a SETE. This induces a Nash equilibrium in the agent normal form of the corresponding augmented game. We further propose a polynomial-time algorithm and a decentralized learning dynamic to compute such product-form equilibria. We then introduce \emphMediated Self-Enforcing Transfer Equilibrium (M-SETE), where a mediator makes both the payment schedule and the prescribed strategies binding offers. This additional enforcement resolves the agent-normal-form limitation: an M-SETE is a Nash equilibrium of the augmented game itself, not merely of its agent normal form, and any socially optimal strategy profile can be supported as an M-SETE in any finite game while preserving budget balance. Thus, internal transfers improve welfare and computation while preserving independent play on the equilibrium path. When full sequential-game stability is required, binding mediation provides the corresponding implementation.

[LG-181] A Gated Graph Neural Network Approach to Fast-Convergent Dynamic Averag e Estimation

链接: https://arxiv.org/abs/2606.20955
作者: Antonio Marino,Claudio Pacchierotti,Paolo Robuffo Giordano
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 18 pages

点击查看摘要

Abstract:Dynamic average estimation is a critical problem in multi-agent systems, enabling agents to collaboratively estimate time-varying signals using only local information exchange. Traditional model-based approaches often face challenges related to convergence speed and sensitivity to network topology changes. This paper introduces a novel learning-based solution leveraging Gated Graph Neural Networks (GGNNs) for fast-convergent dynamic average estimation in a fully distributed manner. Taking advantage of the inherent structure of GGNNs, the proposed method models the estimation process as a distributed autoregressor, ensuring rapid convergence while maintaining stability. We incorporate a regularization term during training to enforce convergence guarantees and introduce an encoding-decoding mechanism to reduce communication overhead without sacrificing accuracy compared to standard GGNNs. Extensive numerical experiments demonstrate that our approach significantly outperforms conventional model-based estimators in terms of both convergence speed and precision, making it a promising alternative for multi-agent applications that require dynamic average estimation.

[LG-182] Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

链接: https://arxiv.org/abs/2606.20945
作者: Vishesh Tripathi,Abhay Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-attention is central to Transformer performance and is often the most expensive part of the Transformer at long context lengths because its pairwise token interactions scale quadratically with sequence length. Standard dense attention also applies the same set of attention heads to every token regardless of token difficulty or information content. This uniform activation can waste compute, especially as sequences grow longer and attention cost increases rapidly. We propose Grouped Query Experts (GQE), a mixture-of-experts layer on top of grouped-query attention (GQA). Within each GQA group, a router selects k query-head experts per token while all key-value (KV) heads remain dense and unchanged. Thus, GQE keeps the KV cache benefits of GQA and reduces only the active query-head computation. On a fixed 30B token budget at the 250M parameter scale, GQE matches the all-active GQA baseline in downstream accuracy while activating half the query heads per token.

[LG-183] Learning through Internalization

链接: https://arxiv.org/abs/2606.20937
作者: Nikolaos Tsilivis,Nirmit Joshi,Marko Medvedev,Julia Kempe,Nati Srebro
类目: Machine Learning (cs.LG)
*备注: first version, 43 pages

点击查看摘要

Abstract:We study internalization processes, by which neural-network-based systems absorb an explicit computational procedure into their own weights, and how they facilitate learning. We investigate how transformers internalize the simulation of semiautomata by internalizing chain-of-thought (CoT) tokens, which classes of semiautomata are harder to internalize, and expose the flip side of internalization, that is, a progressive degradation of out-of-distribution performance. We then provide the first provable analysis of successful internalization: for the task of learning parities, we show that a simplified one-layer transformer provably first learns the target with explicit CoT supervision and then internalizes the autoregressive generation as CoT tokens are progressively removed, learning to directly compute the parity. This task is computationally hard to learn from data without CoT supervision. Finally, we discuss how learning through internalization relates to the \textitPositive Distribution Shift phenomenon recently introduced by~\citetMed+26.

[LG-184] owards Robust Training in NNGPT AutoML Pipeline: A Loss-Optimizer Pairing Selection Study

链接: https://arxiv.org/abs/2606.20933
作者: Anton Abramochkin,Radu Timofte,Dmitry Ignatov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The choice of loss function and optimizer is an important decision, that shapes further model training. Yet automated architecture search pipelines (AutoML) benefits significantly more from the optimal pairing selection and vice versa. This paper investigates whether a single recipe is sufficient for heterogeneous architecture pools, or whether the optimal pairing varies across structurally diverse models. We conduct a systematic empirical study of all 3 \times 6 = 18 combinations of six optimizers (SGD+Momentum, Adam, AdamW, RMSprop, Adagrad, Adadelta), paired with three loss functions: Cross-Entropy (CEL), Negative Log-Likelihood (NLL), and the recently introduced genetically evolved NGL loss across the base models presented in LEMUR heterogeneous architecture pool on six image classification datasets (CelebA-Gender, CIFAR-10, CIFAR-100, ImageNette, MNIST, SVHN). The 18 loss-optimizer configurations are applied to each of the 33 compatible base architectures taken from the LEMUR pool, resulting in 594 variants that were generated fully automatically by a source-level injection pipeline and evaluated under fixed hyperparameters, ensuring that observed accuracy differences are attributable solely to the loss-optimizer pairing. Our results confirm that no single pairing is universally optimal. Cross-Entropy with Adam or AdamW is the most robust choice across architecture families and datasets. NGL is a competitive alternative to CEL on standard convolutional classifiers, but only when paired with adaptive optimizers; it degrades substantially with SGD or accumulation-based methods. Adagrad and Adadelta consistently underperform under fixed hyperparameters regardless of loss function, highlighting their sensitivity to learning rate tuning. These findings provide actionable guidance for loss-optimizer selection within NNGPT Framework.

[LG-185] Hierarchical Pooling for Sheaf Neural Networks

链接: https://arxiv.org/abs/2606.20932
作者: Dionisia Naddeo,Carlo Abate,Pietro Liò,Nicola Toschi,Filippo Maria Bianchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sheaf Neural Networks (SNNs) generalize Graph Neural Networks (GNNs) by replacing scalar node signals with stalk-valued signals and by using restriction maps to measure compatibility across edges. Unlike standard graph diffusion, which encourages neighboring node features to become similar, sheaf diffusion promotes consistency through the restriction maps and can therefore model more general relationships between neighboring nodes. However, existing sheaf neural architectures mainly operate at a fixed graph resolution and do not provide a principled pooling mechanism for building hierarchical representations. In this paper, we introduce Hierarchical Sheaf Pool (HiSP), a sheaf-aware pooling framework based on local spectral coarsening. Given a partition of the graph, HiSP constructs each coarse stalk by projecting fine stalk-valued features onto the low-frequency eigenmodes of the cluster-internal sheaf Laplacian. These local modes define a cochain-level prolongation map, which allows the fine sheaf energy to be represented on the coarse space through a Galerkin operator. We further analyze the approximation induced by coarsening by separating truncation loss, due to discarded local modes, from realization loss, due to representing the projected operator as a coarse sheaf. Finally, we implement HiSP as a GNN pooling layer compatible with SNNs and provide a PyG implementation supporting batching, lifted sheaf Laplacians, and hierarchical architectures.

[LG-186] Ω: Operator-based Mixture Ensemble for Generative Assimilation

链接: https://arxiv.org/abs/2606.20920
作者: Pouria Behnoudfar,Nan Chen
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Characterizing non-Gaussian posterior distributions in partially observed high-dimensional nonlinear systems remains a fundamental challenge in data assimilation. Ensemble Kalman filters rely on Gaussian approximations that can be inaccurate for strongly non-Gaussian posteriors, whereas particle filters suffer from severe scalability limitations. Recent score-based generative approaches improve posterior characterization but typically require supervised training with ground-truth posterior samples, which are unavailable in most practical applications. We introduce \Omega (Operator-based Mixture Ensemble for Generative Assimilation), a scalable framework that integrates conditional Gaussian surrogate modeling, unsupervised score learning, and generative sampling. The conditional Gaussian surrogate provides a nonlinear non-Gaussian baseline approximation while admitting closed-form conditional posterior distributions for the unresolved variables. First, \Omega exploits these closed-form conditional distributions to analytically recover the high-dimensional unobserved component, reducing computational cost and mitigating the curse of dimensionality. Second, \Omega learns only the residual discrepancy beyond an analytical baseline through denoising score matching using ensemble trajectories alone, eliminating the need for ground-truth posterior samples and substantially reducing the learning burden. Third, \Omega reconstructs the full non-Gaussian posterior distribution of both observed and unobserved variables via a Gaussian mixture representation, capturing multimodal, skewed, and heavy-tailed statistics. Finally, \Omega employs annealed Langevin sampling to iteratively refine ensemble members from the baseline toward the target posterior. \Omega is validated on several turbulent models with intermittency and extreme events, consistently improving posterior accuracy.

[LG-187] Physics-Guided Dual-Stream Heterogeneous Graph Neural Network for Predicting Full-Field Structural Response of Stiffened Panels

链接: https://arxiv.org/abs/2606.20916
作者: Yuecheng Cai,Jasmin Jelovica
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Iterative design and optimization of large, complex structures require fast and accurate prediction of stress, displacement, and other fields. Finite element analysis (FEA) is computationally expensive for this task. Existing neural network surrogates often struggle with varying topologies and complex boundary conditions. This study proposes the novel Dual-Stream Heterogeneous Graph Neural Network (DS-HGNN) for full-field stress and displacement prediction in thin-walled structures, demonstrated on box beams made of stiffened panels. DS-HGNN operates on panel-level heterogeneous graph representations and introduces physics-guided edge states initialized from edge types, spatial information, and boundary kinematics. These states are updated through dual-stream message passing that separates longitudinal and transverse structural information while allowing cross-stream exchange. Geometry and loading effects are incorporated through Feature-wise Linear Modulation (FiLM)-conditioned 1-D spectral convolutions, and physical fields are reconstructed using a spectral-bypass low-rank readout. The model is evaluated on stiffened panel datasets with different geometries, boundary kinematics, loading conditions, and material nonlinear responses. DS-HGNN achieves the lowest stress and displacement RMSE compared with six benchmark heterogeneous graph neural network models. It also reaches comparable accuracy to the strongest benchmark models using 19%-38% fewer training samples. A targeted evaluation further shows that DS-HGNN captures yield and post-yield stress features.

[LG-188] mporal Causal Prior-Data Fitted Networks for Panel Data with Learned Reliability Signals

链接: https://arxiv.org/abs/2606.20889
作者: Shravan Talupula,Saurabh Sharma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 42 pages, 6 figures, 9 tables. Code and pretrained checkpoint to be released

点击查看摘要

Abstract:Estimating causal effects in industrial time series requires handling temporal dynamics, time-varying treatments, and unobserved confounders. Existing causal foundation models (CausalPFN, CausalFM) operate only on static cross-sectional data; neural temporal methods (CRN, G-Net) require per-dataset training; and concurrent temporal-PFN proposals have not been demonstrated at industrial scale. None output explicit per-pair reliability signals alongside their CATE estimates. We introduce Temporal Causal Prior-Data Fitted Networks (TCPFN), a foundation model for zero-shot temporal causal discovery with learned reliability signals. TCPFN makes four contributions: (1) a Causal Judgment Head that jointly predicts null-effect probability, confounding strength, identifiability, mediation fraction, and causal regime; (2) a mixed training prior covering six causal regimes (independent, direct, confounded, mediated, time-varying confounded, feedback) plus CausalFM-style front-door and instrumental-variable priors; (3) a discrete-token panel-data architecture with cross-attention masking that prevents inter-horizon leakage; (4) zero-shot inference at industrial scale via FAISS-based context selection and one-step posterior correction. On 19 benchmark datasets across five domains, TCPFN achieves competitive zero-shot causal discovery: AUROC 0.96 on Tennessee Eastman, 0.93 on SWaT, 0.98 on Causal Rivers, 0.97 on CAUSRCA. The null detector reaches NullF1 0.94, AUROC 0.99. TCPFN scales to V=1,275 on a proprietary Kraft pulp-and-paper dataset in 6 hours on a single GPU; PCMCI, a CPU-only library, on a V=666 sub-panel of the same data took 81.5 hours, extrapolating by O(V^2) to ~12.5 days at V=1,275. TCPFN’s top edges identify cross-subsystem causal relationships while PCMCI’s surface within-instrument controller-measurement coupling – a scalability case study. Comments: 42 pages, 6 figures, 9 tables. Code and pretrained checkpoint to be released Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) ACMclasses: I.2.6; I.5.1; G.3 Cite as: arXiv:2606.20889 [cs.LG] (or arXiv:2606.20889v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.20889 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-189] Understanding Latent Flow Models for Tabular Data Synthesis: Targets Paths and Sampling ECML-PKDD2026

链接: https://arxiv.org/abs/2606.20878
作者: Bahrul Ilmi Nasution
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ECML-PKDD 2026, Applied Data Science Track. 18 pages main, 27 pages supplementary, 7 figures, 45 tables

点击查看摘要

Abstract:Synthetic tabular data enables microdata sharing in regulated domains, yet deploying continuous-time generative models requires balancing analytical utility, disclosure risk, and computational cost. Latent-space flow models are flexible, but theoretical equivalences across learning targets, probability paths, and sampling dynamics can translate into different behaviour under finite-step integration and explicit compute budgets. We present an empirical study of tabular latent flow models across seven datasets, evaluating velocity, score, noise, and posterior matching objectives under optimal transport (OT) and variance-preserving (VP) paths, ODE and SDE sampling, and varying integration budgets. Our contributions are threefold: (1) we show that the learning target largely determines the utility-risk operating regime, with velocity and posterior matching tending to yield higher utility, while score and noise matching tend to achieve lower disclosure risk; (2) we demonstrate that configuration and sampling choices shift performance, with midpoint often improving distributional fidelity and OT paths often tolerating earlier stopping than VP, enabling compute savings under fixed budgets or risk thresholds; and (3) we distil these findings into actionable defaults and practical configuration guidance to support pre-release model selection under disclosure risk and resource constraints. The code implementation and supplementary materials can be accessed in this https URL.

[LG-190] Synthetic Network Packet Generation through Statistical Learning and Genetic Algorithms

链接: https://arxiv.org/abs/2606.20864
作者: Mayank Raj,Nathaniel D. Bastian,Lance Fiondella,Gokhan Kul
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing robust intrusion detection systems (IDS) for IoT environments requires large, labeled datasets capturing realistic traffic distributions across both benign and malicious activity. Existing public datasets suffer from fixed activity distributions and extreme class imbalance, while deep generative models (GANs, VAEs) provide no mechanism to enforce that synthetic packets remain within physically valid feature ranges. This paper proposes and compares two constraint-enforcing approaches for synthetic IoT network packet generation: (i) a statistical learning method combining PCA-based latent space sampling with dual One-Class SVM (OCSVM) and Isolation Forest (IF) boundary enforcement, and (ii) a genetic algorithm (GA) method that treats packet generation as a multi-objective optimization problem with explicit fitness criteria for anomaly model acceptance and distributional fidelity. Both methods embed hard validity constraints – dual anomaly-detection gating, feature-range clamping, and independent validation – directly into the synthesis pipeline. Evaluation on the complete ACI IoT 2023 dataset (1,231,411 packets, 12 attack categories, class imbalance up to 175,805:1) demonstrates that both methods achieve PASS status across all categories under independently trained validators with a 30% anomaly rate threshold: the statistical method attains 1.20% average anomaly rate with ~1,091 packets/s throughput, while the GA attains 0.62% average anomaly rate with organic per-class variance (0.00%-2.50%) at ~5.7 packets/s. Both methods successfully amplify the 5-sample ARP Spoofing category by 200x to 1,000 validated packets. The ~190:1 throughput ratio between methods, combined with their complementary quality profiles, provides evidence-based selection criteria for deployment contexts ranging from rapid dataset augmentation to adversarial robustness testing.

[LG-191] Evolutionary Discovery of Developmental Reward Schedules in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2606.20858
作者: Alan Nadelsticher Ruvalcaba
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at the 2026 IEEE International Conference on Development and Learning (ICDL)

点击查看摘要

Abstract:The temporal structure of reward composition in reinforcement learning (RL) is typically hand-designed and held fixed throughout training, leaving the progression of motivational priorities largely unexplored. In this work, we propose an evolutionary framework for discovering developmental reward schedules, in which three distinct biologically inspired motivational components – agency, novelty, and reactivity – are combined through time-varying weights that dynamically shift over the course of training. Evaluated on two sparse-reward MiniGrid tasks: DoorKey-6x6 and KeyCorridorS3R1, our framework compares the generalizability of four evolutionary algorithms: CMA-ES, xNES, DE, and L-SHADE against an extrinsically motivated baseline (our main comparison point), and three additional hand-designed methods. On DoorKey-6x6, all evolved methods outperform the non-evolved baselines, with L-SHADE achieving the best performance – an approximate relative mean improvement of 11.4% over the extrinsic only baseline. On KeyCorridorS3R1, CMA-ES achieves the best overall performance, with the remaining evolved methods showing weaker and less reliable generalization capability compared to the extrinsic only baseline. Interestingly, the discovered schedules diverge from our defined developmental ordering, with novelty consistently emerging as the dominant early signal during training, across both tasks. Collectively, our results position evolutionary optimization as a promising approach for developmental reward schedule discovery in deep reinforcement learning, and suggest that what evolution finds to be optimal in computational settings may differ from what it finds to be optimal in biology. The code for this project can be found at: this https URL.

[LG-192] CELEUS: Certifiable and Efficient LLM Evaluation via E-Processes

链接: https://arxiv.org/abs/2606.20820
作者: Zhijian Zhou,Zesheng Ye,Zhaorun Chen,Bo Li,Feng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Can we trust evaluation scores to capture an LLM’s true real-world performance? Certifiable evaluation answers this question by providing guarantee for LLM evaluation. In particular, existing methods sequentially curate evaluation samples and keep updating confidence intervals (CIs) that cover the true performance with high probability (e.g., 95%) until some conditions are satisfied, e.g., the CI width reaches a target precision. However, existing methods are not generally anytime-valid: the claimed coverage (e.g., 95%) may fail when CIs are repeatedly updated and used to decide when to stop, leaving a gap between theoretical rigor and practice. This paper bridges this gap by proposing Celeus, a Certifiable framework for Efficient LLM evaluation, which leverages E-processes to build anytime-valid CIs. Concretely, we propose signals that combine two ingredients: (i) Uncertainty-guided sampling to select informative samples for evaluation, and (ii) Surrogate-assisted approximations for unevaluated samples. We prove that such signals remain unbiased for the evaluation score conditional on the past, enabling statistically-grounded and anytime-valid e -process CIs. More importantly, the two ingredients reduce estimation variance and help reach the target precision with fewer evaluated samples. We also prove that CIs obtained by Celeus can shrink at a near-parametric rate up to logarithmic factors and analyze the oracle variance-optimal sampling rule that motivates the empirical uncertainty-guided one. Experiments show that Celeus reaches the target precision using 54-62% fewer evaluated samples than baselines, while preserving anytime-valid coverage.

[LG-193] Provably Sub-Linear Two-Timescale NeuroEvolution with Online Plasticity ECAI2026 IJCAI

链接: https://arxiv.org/abs/2606.20817
作者: Shishen Lin,Yixin Chen
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 28 pages; Accepted at IJCAI-ECAI 2026

点击查看摘要

Abstract:NeuroEvolution of Augmenting Topologies (NEAT) is a widely used neuroevolution algorithm for learning neural network architectures and weights for control tasks. However, standard offline optimisation searches for connection strengths directly, which can scale poorly in high-dimensional weight spaces and more difficult continuous control problems. Hybrid methods that combine neuroevolution with online learning can address this challenge, but their theoretical properties remain underexplored. This paper gives the first regret analysis for a general NeuroEvolutionary Online Learning (NEOL) framework, which decouples learning into two timescales: an outer loop for architecture search and an inner loop for online weight adaptation via rewardmodulated plasticity. Under mild conditions, we prove that NEOL achieves sublinear regret. Empirically, under fixed interaction budgets on four standard control benchmarks, a NEAT-based NEOL implementation achieves higher final fitness and lower variance than pure NEAT, and is competitive with strong reinforcement learning (RL) baselines on several tasks. The results are supported byWilcoxon rank-sum tests and ablation studies. Overall, the findings show that online plasticity can improve the sample efficiency and robustness of two-timescale neuroevolution. Code is available at this https URL Online Learning NEOL.

[LG-194] Eigenspace-Based Clustering for Personalized System Identification

链接: https://arxiv.org/abs/2606.20811
作者: Abdulmoneam Ali,Dipankar Maity,Ahmed Arafa
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We study the problem of system identification in heterogeneous settings, where different systems may follow distinct underlying dynamics. Existing clustered system identification approaches often rely on iterative training-based cluster assignment, which can be sensitive to learning uncertainty and model initialization. In contrast, we propose a one-shot, training-free clustering method that identifies similar systems using the structure of their locally observed data. Specifically, each system estimates a local state covariance matrix, and cluster identities are inferred by measuring the alignment between the leading covariance eigenspaces of different systems. We provide a mathematical interpretation of the proposed similarity score and develop a finite-sample analysis that characterizes how covariance estimation error induces eigenspace perturbations in terms of the underlying system dynamics. We then derive a probability bound for pairwise false merges and a global clustering success guarantee. Numerical experiments demonstrate that the proposed eigenspace-based clustering method effectively identifies systems with shared dynamics, leading to lower personalized model-estimation error compared with training-based clustering and non-clustered baselines.

[LG-195] ELADO: Elliptic PDE Assessment Datasets for Operator Learning

链接: https://arxiv.org/abs/2606.20771
作者: Frank Ehebrecht,Toni Scharle,Martin Atzmueller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce ELADO (Elliptic PDE Assessment Datasets for Operator Learning), a systematic benchmark suite constructed to show and quantify failure modes of neural operator architectures when learning solution operators of elliptic PDEs. While the benchmarks of existing datasets focus on average case performance, the ELADO datasets are constructed to highlight challenges that arise naturally in elliptic PDE problems. In particular, we construct several datasets built around Poisson’s equation and the Helmholtz equation, each with non-constant coefficients. We define a controllable data-generating process to create datasets, that are designed to isolate a distinct source of difficulty. Specifically, these are (1) heavy-tailed solution distributions arising from light-tailed coefficient field distributions, (2) spectral distribution shift of the input data, (3) heavy-tailed distributions in the frequency domain of solutions, arising from light-tailed coefficient field distributions, (4) input sensitivity of learned operators, quantified by an empirical local Lipschitz analysis, and (5) the effect of input signal complexity on prediction accuracy under controlled amplitude normalization. We evaluate several neural operator architectures across all datasets and show that heavy-tailed targets, spectral shift, and input sensitivity each cause substantial degradation of the prediction accuracy that standard datasets and metrics (e.g., the mean relative L^2 error) may obscure.

[LG-196] Evidential Fusion Network for Multimodal Survival Prediction under Missing Modalities

链接: https://arxiv.org/abs/2606.20757
作者: Yucheng Xing,Hailan Mo,Zi Wang,Ling Huang,Mengling Feng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent multimodal survival prediction models have demonstrated strong predictive performance by leveraging complementary information across modalities. However, such models generally assume data completeness and exhibit limited robustness toward missing modalities, which are frequently encountered in real-world clinical settings. We propose the Evidential Missing Modality Survival Fusion (EMMS) model for multimodal survival prediction under missing modalities. EMMS offers a straightforward, computationally effective approach to survival analysis without requiring a generative phase for missing data. By employing Dempster-Shafer theory and Gaussian Random Fuzzy Numbers for multimodal decision fusion, it considers both aleatoric and epistemic uncertainty alongside modality reliability for fusion. Moreover, the model treats missing modalities as vacuous evidence, preventing interference with available inputs and naturally reflecting increased uncertainty and calibrated predictions. Extensive experiments on four cancer datasets demonstrate state-of-the-art performance while providing calibrated and interpretable uncertainty estimates under incomplete multimodal observations, without introducing additional computational overhead.

[LG-197] CIExplainer: Generating Causal and Interpretable Explanations for Graph Neural Networks

链接: https://arxiv.org/abs/2606.20747
作者: Francisco Caldas,Sahil Satish Kumar,Ruben Belo,Cláudia Soares
类目: Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence aims to make black-box models more trustworthy by presenting, in a human-understandable manner, the elements that lead to the model’s output. This involves both (i) identifying components and connections with genuine causal influence on outputs and (ii) translating such structures into an interpretable representation. For the former, we introduce CIExplainer, a novel perturbation-based method grounded in causal inference for explaining Graph Neural Networks (GNNs). CIExplainer identifies the subgraph with the highest causal effects on GNN predictions using the Potential Outcome Framework. We evaluate and compare CIExplainer on various GNN architectures (GCN, GraphSAGE, GAT, GIN) and datasets. To bridge subgraph explanations with human interpretability, we further propose G2TeXplainer, a method that transforms causal subgraphs into natural language explanations that capture both feature-level and relational information. Subjects: Machine Learning (cs.LG); Other Statistics (stat.OT) Cite as: arXiv:2606.20747 [cs.LG] (or arXiv:2606.20747v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.20747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-198] FPGA-Accelerated Neuromorphic Vision System for Real-Time Orbital Object Detection

链接: https://arxiv.org/abs/2606.20727
作者: Diego Hernández,Sebastián Valdivia,Vicente Westerhout,Esteban Vera,Daniel Yunge
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 12 figures

点击查看摘要

Abstract:The escalating congestion in orbital space demands advanced monitoring solutions. This work presents a comprehensive open-source framework for neuromorphic resident space object (RSO) detection, adapting the foundational grid clustering algorithm for FPGA acceleration. The system integrates a single event-based camera (EBC) with a custom, distributed processing architecture, where rapid spatial quantization is executed in programmable logic (FPGA) and cluster formation is managed by a software client. We validate this architecture through systematic sampling of night-sky observations from the EVAS dataset, demonstrating 97% detection accuracy for RSOs. The implementation, which serves as a foundational toolkit for event-based FPGA processing, achieves efficient throughput with a total power consumption of 8.5 W and deterministic processing latencies below 62 ms. The architecture’s energy efficiency and high-precision detection position it as a viable solution for distributed space surveillance networks.

[LG-199] A Generalized Formalism of Auto-Regressive Decoding for Speech Processing INTERSPEECH2026

链接: https://arxiv.org/abs/2606.20714
作者: Julia Gachot,Philipp Allgeuer,Marie S. Bauer,Stefan Wermter
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:In speech processing, most state-of-the-art sequence prediction models rely on auto-regressive (AR) strategies to generate output sequences based on the raw predictions of the model. Despite their crucial role in the inference process, a comprehensive overview of AR strategies as a unified field is lacking, due largely to implicit and multiple definitions of next-token decoding. This context complicates the choice, comparison, and evaluation of strategies, while creating inconsistencies in the characterization of approaches as auto-regressive or not. We begin by setting explicit inclusion criteria for the field of AR search in speech processing, and derive a generalized theoretical framework to categorize and report on search strategies for neural models. We show the capabilities of this formalism in simplifying the design of benchmarks centered around the decoding process, allowing for ablation studies that are focused on search strategies.

[LG-200] Analysis and Prediction of At-Risk Students Using Machine Learning Algorithms

链接: https://arxiv.org/abs/2606.20617
作者: Soheila Gheisari,Hamid Salarian
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Student attrition represents a significant challenge for higher education institutions because it impacts both academic results and financial viability. Machine learning provides an effective solution to identify students who require assistance before they leave their academic programs. The research investigates how machine learning approaches enable institutions to predict student withdrawal and enrollment cancellation through data-driven insights for strategic decisionmaking. The evaluation of models includes Logistic Regression, Random Forest, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN) based on academic performance and demographic data and enrollment records. The results show that logistic regression and linear SVM models produced the highest accuracy which demonstrates ML’s capability to detect students at risk.

[LG-201] Estimating Learners Skill Acquisition Without Temporal Information

链接: https://arxiv.org/abs/2606.20611
作者: Ryosuke Nagai,Kyohei Atarashi,Koh Takeuchi,Jill-Jênn Vie(SODA),Hisashi Kashima
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research in educational data mining, especially knowledge tracing, has focused on predicting learners’ future knowledge states to support adaptive instruction. However, in many real-world educational settings, learning data are often available only as single-time-point assessments without temporal information, making existing time-series-based approaches difficult to apply. In this paper, we propose a novel framework for predicting future skill acquisition using only snapshot data. Specifically, we address the problem of predicting the next skill to be acquired from skill mastery patterns estimated by cognitive diagnostic models (CDMs). In the absence of temporal information, we exploit inclusion relations among learners’ skill sets to induce a pseudo-temporal ordering, interpreting expanding skill sets as a proxy for learning progression. To efficiently approximate unobserved acquisition paths, we introduce a neural model that captures latent skill acquisition dynamics through expected skill increments. Experiments on both synthetic and real-world datasets demonstrate that the proposed method consistently outperforms baseline approaches, with particularly strong advantages as the skill space becomes larger. These results indicate that meaningful skill acquisition patterns can be inferred from snapshot data alone, providing a practical framework for adaptive learning support in data-constrained educational environments.

[LG-202] Empowering Embodied AI in 6G Networks: Architecture Enablers and Open Challenges

链接: https://arxiv.org/abs/2606.20592
作者: Junaid Sajid,Sheikh Salman Hassan,Wenshuai Liu,Yan Kyaw Tun,Yaru Fu,Nguyen H. Tran,Zhu Han,Cedomir Stefanovic,Tharmalingam Ratnarajah,Muhammad Mahtab Alam
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE Communication Magazine for possible publication

点击查看摘要

Abstract:Embodied artificial intelligence (AI) is emerging as a key driver of the sixth-generation (6G) wireless networks by enabling agents that continuously perceive, communicate, and act in dynamic physical environments. Unlike conventional AI systems that process disembodied data, embodied agents such as robots, autonomous vehicles, and extended reality (XR) devices operate through closed-loop perception-communication-action (PCA) interactions, where communication performance directly affects physical behavior, control stability, and task success. However, existing AI-native wireless architectures remain largely connectivity-centric and are not designed to support task-driven embodied intelligence at large scale. Therefore, we present a holistic framework for embodied AI-native 6G systems, in which communication, sensing, computation, and control are jointly designed as a unified closed-loop infrastructure. We introduce a system-level PCA architecture, discuss key enabling technologies and representative applications, and highlight major open challenges in multimodal intelligence, edge-aware deployment, evaluation, trustworthiness, and practical implementation. Our central argument is that future 6G systems must evolve from intelligent communication platforms into active enablers of embodied physical intelligence.

[LG-203] Action-BED: Task-Driven Bayesian Experimental Design with Singly Intractable Objectives

链接: https://arxiv.org/abs/2606.23662
作者: Tom Rossa,Angus Phillips,Tom Rainforth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Bayesian experimental design (BED) has traditionally been based on maximising expected uncertainty reductions from prior to posterior. A major shortfall of this approach is that it leads to doubly intractable objectives that are difficult to optimise, while customising them to particular downstream tasks of interest can also be difficult. Following first principles decision theory, we demonstrate that BED can alternatively be formulated in terms of an expected future loss (EFL) on downstream actions, providing a simple and naturally task-driven framework. Critically, we then show that all such EFLs can be rearranged into singly intractable objectives that can be jointly optimised with respect to both the design policy and a downstream action policy using stochastic gradients, an approach we refer to as ACTION-BED. This formulation further sidesteps the need for any explicit posterior or marginal likelihood estimation and is naturally implicit, requiring only the ability to sample from the joint model over model parameters and data, and evaluate the downstream loss function. It thus allows design policies to be learned more effectively, efficiently, and simply than existing methods, while providing easy customisation to different downstream tasks and losses.

[LG-204] Diffusion Models Adapt to Low-Dimensional Structure Under Flexible Coefficient Choices

链接: https://arxiv.org/abs/2606.23627
作者: Changxiao Cai,Yuchen Jiao,Gen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Diffusion models are known to exploit unknown low-dimensional structure to accelerate sampling. However, existing convergence theory under low-dimensional data structure has largely focused on update rules with narrowly prescribed coefficient choices. This raises a fundamental question: is adaptation to low-dimensional structure sensitive to the precise choice of update coefficients? In this paper, we show that such adaptation is a robust property of diffusion models. For a broad class of update coefficients, we prove that \widetildeO(k/\varepsilon) iterations suffice to generate an \varepsilon -accurate sample in total variation (TV) distance, independently of the ambient dimension. Our framework substantially broadens the class of diffusion samplers known to enjoy low dimensional adaptation and applies to several commonly used methods in practice. These results provide a theoretical justification for the empirical effectiveness of diffusion samplers across different coefficient choices when applied to structured, high-dimensional data.

[LG-205] Neural Networks as Linear Regression: An Introduction for Statisticians

链接: https://arxiv.org/abs/2606.23601
作者: Abigail Loe,Susan Murray,Zhenke Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are a commonly used prediction tool in computer science and statistics. However, the barrier to entry of this interesting field remains high, particularly for classical statisticians trained in a frequentist perspective. In this letter, we demystify neural networks by describing networks that approximate a linear regression and describe common customizations that provide a foundation for further study.

[LG-206] Approximating velocity fields with planted attractors via Neural-ODEs for classification purposes

链接: https://arxiv.org/abs/2606.23550
作者: Feliciano Giuseppe Pacifico,Duccio Fanelli,Lorenzo Buffoni,Lorenzo Chicchi,Diego Febbe,Raffaele Marino
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, Neural ODEs equipped with a curated collection of equilibrium points have been successfully employed for classification this http URL planted attractors serve as indicators for the target classes, while the velocity field leveraging the universal approximation capabilities of the architecture shapes the dynamical this http URL process defines the basins of attraction of the trained model, effectively directing each input provided as an initial condition toward its corresponding destination target.

[LG-207] SuperCond-GNN: Scalable Graph Neural Network Surrogate for Superconducting Circuit Simulations

链接: https://arxiv.org/abs/2606.23548
作者: Nandana Menon,Giorgio Vallone
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:This paper presents SuperCond-GNN, a graph neural network-based surrogate model for predicting the voltage distribution in high-temperature superconducting (HTS) magnets. HTS magnets are modeled as lumped-element equivalent circuits and mapped onto graph representations, enabling message passing GNNs to learn the electrical response as a function of circuit topology, material properties, and operating current. As a proof of concept, tape stacks of up to 10 tapes are considered across a range of circuit topologies and operating conditions. The surrogate is trained on data generated from circuit simulations and achieves a mean MAPE of 4.3 % within the prescribed design space. The predicted nodal voltages enable fast and scalable inference of current redistribution and local operating conditions across a wide range of circuit configurations. The effect of incorporating physics-informed regularization via Kirchhoff’s current law is also evaluated, and generalizability to unseen topologies is assessed through zero-shot inference and few-shot fine-tuning. While demonstrated on tape stack circuits, the graph-based framework is topology-agnostic and naturally extensible to more complex HTS cable and magnet configurations, offering a scalable alternative to conventional circuit solvers for downstream applications such as design space exploration, current sharing analysis, and real-time magnet monitoring.

[LG-208] FairBED: A Bayesian Experimental Design Approach to Gathering Fairer Data

链接: https://arxiv.org/abs/2606.23515
作者: Marcel Hedman,Emily Alger,Brieuc Lehmann,Chris Holmes,Tom Rainforth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Frameworks for ensuring fairness in machine learning typically focus on learning fair models from existing data. But this endeavor is often undermined by biases already present in that data. We therefore look to modify the data acquisition process itself to help gather fairer data that is inherently more suitable for training fair predictors. To this end, we introduce FairBED, which provides novel formulations for quantifying the fairness of datasets themselves based on the idea that fair datasets should be uninformative about sensitive attributes. We then use this to construct practical fairness-aware Bayesian experimental design (BED) objectives that maximize expected information gain about the target quantity of interest while minimizing expected information gain about sensitive attributes. We further derive a theoretical link between FairBED and demographic parity, and show empirically that models trained on data gathered using FairBED provide improved fairness-accuracy trade-offs compared to randomly acquired data and conventional BED.

[LG-209] Sublinearly Structured Deep Neural Networks Achieve Feature Learning Consistency for Compositional Functions

链接: https://arxiv.org/abs/2606.23477
作者: Sehwan Kim,Yan Sun,Faming Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Over the past decade, deep neural networks (DNNs) have achieved remarkable success on complex machine-learning tasks, yet the theoretical foundations of their performance remain incomplete. From a statistical viewpoint, a natural question is: can DNNs attain feature-learning and prediction consistency comparable to that of classical models? While a full characterization is open, we provide positive results for a broad subclass. We establish feature-learning consistency guarantees for sublinearly structured DNNs-architectures whose input/output dimensions and number of hidden neurons grow sublinearly with the sample size-when learning hierarchically compositional target functions. Importantly, this consistency still holds even in the conventional “over-parameterized” regime where the total number of parameters exceeds the number of training samples. Empirically, sublinearly structured DNNs match or surpass wide DNNs in prediction. A structural audit further indicates that widely used convolutional neural networks (CNNs), including AlexNet, VGGNet, ResNet, GoogLeNet, are sublinearly structured on their image classification benchmarks. We further prove that the sublinearly structured DNNs achieve universal approximation for hierarchically compositional functions in the large-sample limit. Moreover, images exhibit an inherent hierarchical, compositional structure. Taken together, these results explain, through a statistical lens, why many large-scale deep learning models succeed after adequate training on massive image datasets.

[LG-210] me Series Classification through Diffeomorphic Time Warping (DiffTW)

链接: https://arxiv.org/abs/2606.23472
作者: Vicky Geneva Haney,Kamel Lahouel,Victor Rielly,Bruno M. Jedynak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages including appendix and references, 8 figures

点击查看摘要

Abstract:Time series classification involves learning a mapping from a continuous, temporally ordered sequence of real-valued observations to a discrete response variable, like class labels. This task is fundamental in domains, including health monitoring, where the temporal structure of data is critical for accurate prediction. Dynamic Time Warping (DTW) is a standard technique for measuring similarity between sequences varying in time or speed. However, DTW is restricted to discrete point matching. To move beyond pairwise alignment, we propose a theoretical framework that learns mappings between real-valued functions. These mappings approximate the flow associated with the characteristic curves of a linear transport equation with a space-dependent velocity field, providing a diffeomorphic transformation between two time series. Using the method of characteristics, we transform this partial differential equation into ordinary differential equations (ODEs) modeling system dynamics. The objective function used to learn these ODEs derives from the fundamental theorem of calculus. To enable flexible, expressive representations of the velocity field, we utilize reproducing kernel Hilbert spaces and optimal control methods. Our method, Diffeomorphic Time Warping (DiffTW), provides a theoretically grounded dissimilarity measure. Using a 1-nearest neighbor classifier, DiffTW outperforms DTW on 60 of 86 datasets.

[LG-211] Quantum Convolutional Neural Networks for Groundwater Heat Plume Prediction: A Surrogate Modeling Approach

链接: https://arxiv.org/abs/2606.23411
作者: Danyal Maheshwari,Julia Pelzer,Miriam Schulte
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning methods are increasingly explored for modeling complex environmental systems, including groundwater heat plume dynamics. In this work, we explore a Quantum Convolutional Neural Network (QCNN) as a surrogate model for predicting temperature variations in groundwater induced by geothermal heat pumps in the city of Munich. To comply with the scalability constraints of current quantum hardware, the original high-dimensional simulation output is reduced to a compact set of representative parameters that serve as training targets for the surrogate. The proposed QCNN architecture consists of a quantum convolutional layer, a quantum pooling layer, and a fully connected quantum readout stage. Convolution and pooling operations are realized via parameterized quantum circuits based on rotational gates and measurement-driven decoding, while a Hamiltonian-inspired feature-encoding scheme is used to prepare informative input states on the quantum device. We evaluate the QCNN across multiple execution backends, including an ideal statevector simulator, a noisy simulator, IBM’s 127-qubit Kyiv quantum processor, and the same hardware augmented with advanced error-mitigation techniques. Realistic noise models are employed to approximate device behavior and to assess the impact of mitigation strategies. Model performance is benchmarked using mean squared error (MSE) on both training and testing sets. The results show that, although classical neural networks still achieve the highest predictive accuracy, the QCNN attains competitive and consistent performance on simulators and exhibits noticeable improvement under error-mitigated hardware conditions. These findings indicate that quantum-enhanced surrogate modeling is a promising direction for future groundwater temperature prediction as quantum hardware and error-mitigation techniques continue to mature.

[LG-212] Ultra-Peripheral Collisions as a Nuclear-Structure Interferometer with Interpretable Multitask Deep Learning

链接: https://arxiv.org/abs/2606.23353
作者: Jing-Zong Zhang,Wang-Mei Zha,Lingxiao Wang,Guo-Liang Ma
类目: Nuclear Theory (nucl-th); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex); Optics (physics.optics); Quantum Physics (quant-ph)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Precise knowledge of nuclear structure is essential across fundamental physics, yet probing these structures is notoriously difficult. To address this challenge, ultra-peripheral collisions (UPCs) provide a femtoscopic tomography for imaging the atomic nucleus. UPCs offer a pristine electromagnetic pathway: coherent vector-meson photoproduction generates patterns of diffraction and two-source interference that directly encode the nuclear spatial density. Turning these patterns into quantitative constraints is, however, a challenging inverse problem, complicated by correlated sensitivities to deformation and neutron skin, phase smearing, and experimental backgrounds. Here we introduce an interpretable Multitask deep-learning framework that maps transverse momentum distributions to multiple nuclear-structure indicators simultaneously and identifies the kinematic regions driving each inference. We demonstrate the approach with coherent J/\psi photoproduction in ^96_40\textZr + ^96_40\textZr collisions, showing that the learned features separate diffraction-dominated and interference-dominated information and provide analysis-ready observables for future high-luminosity data.

[LG-213] Incremental Learning in Mirror Flows

链接: https://arxiv.org/abs/2606.23198
作者: Raphaël Berthier,Loucas Pillaud-Vivien
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study mirror flows generated by a convex quadratic loss and a general convex lower semicontinuous mirror potential. We show that, when initialized near the boundary of the domain of the mirror potential, their rescaled trajectories converge to a limiting mirror flow whose potential is the indicator function of the domain. In this limit, the primal variable minimizes the loss over a time-dependent hypothesis set: the subdifferential of the support function of the domain, evaluated at the dual variable. This characterization provides a general mechanism for incremental learning in mirror flows.

[LG-214] LOLLA: Deep Reinforcement Learning for Closed-Loop Link Adaptation Towards a GPU-Accelerated AI-RAN

链接: https://arxiv.org/abs/2606.23110
作者: Rui Wang,Linchao Zhang,Qiang Liu,Kun Yang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Outer-loop link adaptation (OLLA) is widely deployed in 5G NR to track channel variations, yet its reliance on first-order, single-bit feedback degrades performance significantly under high-mobility and fast-varying channels. This paper presents LOLLA (Learned Outer-Loop Link Adaptation), a deep reinforcement learning framework that replaces the conventional OLLA staircase with a learned, continuous SINR offset conditioned on rich PHY/MAC telemetry inaccessible to OLLA. The offset modulates the SINR-to-MCS lookup table, preserving 3GPP-compliant MCS selection and provably subsuming the conventional OLLA update rule. A Proximal Policy Optimization (PPO) policy trained under a Lagrangian block error rate (BLER) constraint automatically enforces tunable reliability targets from 1% to 15% without manual penalty calibration. The framework is realized as the first closed-loop AI-native control dApp on a GPU-accelerated 5G NR stack, achieving end-to-end control latencies under 500 microseconds. Evaluations under 3GPP TDL channel models demonstrate 15% to 92% throughput gains over OLLA across Doppler frequencies up to 400 Hz, while attaining a Pareto frontier that strictly dominates OLLA across all evaluated reliability targets. The learned policy generalizes to unseen channel models and scales to eight concurrent UEs under shared-resource scheduling. In the uplink formulation, the gNB directly observes decoding outcomes, enabling simulation-to-deployment parity.

[LG-215] Generalized nonparametric regression in reproducing kernel Hilbert spaces: Consistency and rates of convergence

链接: https://arxiv.org/abs/2606.22993
作者: Ioannis Kalogridis
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a comprehensive theory for regularized M-estimation in reproducing kernel Hilbert spaces. Under mild conditions on the loss we establish existence and measurability of the estimator, covering a wide range of convex and non-convex losses, including bounded robust losses. We further prove sharp rates of convergence with an explicit bias-variance decomposition governed by a novel complexity measure. We show that the variance is independent of misspecification, while the bias depends on a source condition parameter known in the learning literature. For tensor product Sobolev spaces we obtain new rates that connect to spaces of functions with dominating mixed smoothness, substantially extending existing results and explaining why these estimators circumvent the curse of dimensionality. Our methodology, combining elements from both functional analysis and empirical process theory, allows for an asymptotic linearisation of the objective function that avoids both closed-form solutions and global Lipschitz assumptions, and may be of independent interest. The estimators are implemented in C++ and theory is supported by numerical experiments.

[LG-216] Scalable Physics-Inspired Transformers for Spin Glasses

链接: https://arxiv.org/abs/2606.22984
作者: Lu Zhong,Wenli Duan,Jing Liu,Pan Zhang,Ying Tang
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient sampling of the Boltzmann distribution in frustrated spin glasses is central to statistical mechanics and combinatorial optimization. Despite advances in machine-learning-based approaches, two issues persist: limited understanding of why variational models fail to benefit from increased scale, unlike the monotonic scaling law of large language models; and high computational cost on large systems that negates advantages over classical sampling methods. Here, we develop a physics-inspired transformer with interpretable sparse attention and spin-tailored positional embeddings to address these challenges. By further leveraging FlashAttention for parallel ancestral sampling, it achieves up to two orders of magnitude speedup over vanilla variational autoregressive networks, enabling neural-network simulations of spin-glass systems to unprecedented sizes on a single GPU. It can resolve full probability distributions, free energies, and overlap statistics across temperatures, for Sherrington-Kirkpatrick and 2D or 3D Edwards-Anderson models, where existing machine-learning methods encounter limitations at certain temperatures. This framework thus establishes a scalable paradigm for frustrated spin-glass systems.

[LG-217] Domain-incremental audio classification using domain-specific experts and prototype classifier

链接: https://arxiv.org/abs/2606.22952
作者: Jongyeon Park,Do-Hyeon Lim,Sang-won Park,Hong Kook Kim,Kyungdeuk Ko,Hyeongcheol Geum,Jeong Eun Lim
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: DCASE 2026 challenge Task7, 4 pages

点击查看摘要

Abstract:This technical report presents submission systems for Task 7(domain-incremental audio classification) of the DCASE 2026 Challenge. The main obstacle is that, the system is unable to access to past or future domain’s data at once. We approached domain-incremental learning (DIL) as a frozen-feature replay problem. At each incremental stage, one or two compact experts are trained and then kept fixed; at the final stage, the penultimate features from all frozen experts are concatenated and used to train a lightweight per-class prototype classifier solely on cached features. This design prevents catastrophic forgetting by preserving each expert models at inference. To retain earlier-domain knowledge without storing raw audio, some experts were trained with DeepInversion-based generative replay. A cross-stage regression imputer was trained to fill the expert feature slots that did not yet exist at an ealier stage. We submit four fully DIL-compliant systems: three systems based on diverse frozen five-expert backbones and their cross-stack ensemble achieving 78.15% micro / 77.03% macro on the development set, outperforming every individual backbone on both evaluations.

[LG-218] nsor Train Decomposition-based 3D Implicit Full Waveform Inversion with Multi-scale Structural Similarity

链接: https://arxiv.org/abs/2606.22867
作者: Liangsheng He,Chao Song,Tiansheng Chen,Tao Liu,Cai Liu
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Three-dimensional full waveform inversion (3DFWI) is a powerful technique for reconstructing high-resolution subsurface velocity models. However, its application is often limited by high memory requirements, computational costs, and sensitivity to cycle skipping. To overcome these challenges, we propose a novel tensor train (TT) decomposition-based 3D implicit full waveform inversion framework (TT-3DIFWI) combined with a multi-scale structural similarity (M-SSIM) objective function. In this framework, the 3D velocity model is represented by TT decomposition as a product of a series of low-rank core tensors. Then, three axis-specific implicit neural network representations (INR) based on one-dimensional vector coordinates as input are constructed to predict these core tensors, rather than directly predicting the velocity model. This INR reparameterization method based on TT decomposition can significantly reduce the memory consumption of INR training while maintaining the accuracy and resolution of the 3D velocity model reconstruction. Meanwhile, the low-rank structure of TT decomposition also ensures the structural consistency of the reconstruction velocity, thereby improving the accuracy and continuity of the inversion result. Furthermore, the M-SSIM objective function can compare the multi-scale structural differences between predicted and observed data, and utilize the ultra-low frequency features to reduce cycle skipping. Numerical experiments on synthetic and challenging land datasets demonstrate that TT-3DIFWI with M-SSIM achieves accurate and continuous velocity reconstruction, even with poor initial models or missing low-frequency data.

[LG-219] arget-Aware Linear Regression Under Distribution Shift

链接: https://arxiv.org/abs/2606.22775
作者: Zhewen Hou,Tian Zheng
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution shift between training and deployment is a pervasive challenge for modern AI systems. In many cases, the target marginals of covariates and response are known or specified through population-level observations, boundary conditions, properties of simulator configurations, or alignment-time distributional constraints. Such knowledge may provide valuable side information for regression estimation. We study this problem in the multivariate linear regression setting with a stable conditional mean E[Y\mid X] across source and target, and identify the hybrid-loss estimator, which jointly incorporates both target marginals, as a benchmark target-aware estimator. Its direct computation, however, requires solving a coupled nonlinear optimization that is expensive at scale. Our main contribution is to develop and evaluate two computationally tractable alternatives: a constrained moment-matching estimator and a two-stage estimator that augments ordinary least squares with a calibration step. For all three estimators, we derive and compare closed-form asymptotic mean squared errors, yielding conditions under which the tractable alternatives match or closely approximate the hybrid benchmark, and regimes in which they do not. Monte Carlo experiments across three controlled shift regimes validate the theoretical results, investigate the accuracy-runtime tradeoffs among the three estimators, and translate into guidance on estimator choice. In particular, the two-stage estimator nearly matches the hybrid benchmark in the high signal-to-noise regime at essentially no additional cost, providing theoretical grounding for empirical observations in nonlinear settings.

[LG-220] Statistical Inference for Misspecified Contextual Bandits

链接: https://arxiv.org/abs/2606.22639
作者: Yongyi Guo,Ziping Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2509.06287

点击查看摘要

Abstract:Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment. Yet these advantages create challenges for statistical inference due to adaptivity. We study inference with contextual-bandit data without assuming a well-specified outcome model. In this setting, we show a previously overlooked issue: standard algorithms such as LinUCB may fail to stabilize under misspecified working models, leading to non-Gaussian estimator behavior and invalid inference. This issue is practically important, as misspecified working models – such as approximations of complex dynamical systems – are often employed by online agents in real-world adaptive experiments to balance reward, computational tractability, and robustness. We develop an inverse-probability-weighted Z-estimation framework for a broad class of marginal moment targets, including projection parameters, structural parameters with noisy contexts, and off-policy values. We identify a stability condition tailored to this framework, scaled inverse-propensity convergence, under which the IPW-Z estimator is consistent and asymptotically normal with a consistent sandwich variance estimator. We further establish sufficient conditions for scaled inverse-propensity convergence for several policy classes, including multi-armed bandit algorithms and smooth contextual allocation policies. Simulations and a HeartSteps V1 real-data-calibrated application show reliable coverage and competitive performance across multiple targets. Overall, our results highlight the importance of stability-aware adaptive design for valid post-experiment inference. Comments: arXiv admin note: substantial text overlap with arXiv:2509.06287 Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2606.22639 [stat.ML] (or arXiv:2606.22639v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.22639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-221] Scalable Bayesian Additive Models for Stellar Flare Detection via Amortized Gaussian Process Inference and Hidden Markov Models

链接: https://arxiv.org/abs/2606.22601
作者: Rodrigo Herrera,Vianey Leos-Barajas,Gwendolyn Eadie,Elizaveta Semenova,James Davenport
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注: Main paper: 19 pages, full paper: 34 pages. 4 appendices. 9 main figures, 21 figures in total. 4 tables. Poster Presenter, SSC 2026 (Statistical Society of Canada Annual Meeting) and ISBA 2026 (International Society for Bayesian Analysis World Meeting)

点击查看摘要

Abstract:Gaussian Processes (GPs) are a powerful tool for Bayesian time-series modeling, yet their cubic computational cost remains a severe barrier for application to long, high-cadence datasets in astronomy. While specialized scalable solvers like Celerite elegantly reduce this scaling to linear time, repeatedly evaluating the exact likelihood during iterative Bayesian sampling is a bottleneck for developing more complex models, like hierarchical or additive models in which Celerite is only one component. To make this inference computationally tractable, we introduce a generative surrogate framework. By utilizing a Variational Autoencoder (VAE) to learn a compressed representation of the Celerite prior, we map highly correlated stochastic dependencies into a low-dimensional, isotropic manifold. This transition completely bypasses exact covariance operations, shifting the computational burden to a rapid neural network forward pass. Through an extensive simulation study, we show that the generative surrogate accurately reproduces the structural fidelity of exact physical kernels like Celerite. Finally, we demonstrate embedding our VAE approximation into an additive model that combines Celerite and a hidden Markov model (HMM) for stellar flare detection in time series data of stars. We evaluate the joint VAE+HMM architecture against the exact Celerite+HMM framework on empirical astrophysical time series and demonstrate that the proposed methodology achieves significant reductions in computational time, enabling the rigorous, large-scale characterization of stellar flares across massive data archives.

[LG-222] Robust Diffusion Models via Divergence-Induced Weighted Denoising

链接: https://arxiv.org/abs/2606.22521
作者: Lei Li,Yuexiao Dong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show that replacing the standard MSE denoising loss in diffusion models with a nonlinear transformation induced by an f-divergence yields a simple robust training surrogate that empirically improves performance under data contamination, with small additional computational overhead. The theoretical foundation rests on a local divergence construction: under the Gaussian reverse-kernel structure of DDPM, each per-step likelihood ratio follows a lognormal distribution parameterized by a scalar mismatch, so the conditional f-divergence at each step reduces to a one-dimensional function of the denoising error. Summing these local divergences yields a training objective that unifies diffusion training as divergence induced weighted denoising, where the derivative of the induced divergence acts as a residual-space influence weight that controls the contribution of each sample. Bounded-influence divergences (Hellinger, negative exponential) suppress large error samples, with Hellinger yielding an explicit exponential weight, connecting the framework to robust M-estimation. Empirically, on CIFAR-10 under 30% contamination, NED reduces FID from 93.0 (KL) to 77.5, while also outperforming standard robust losses such as Huber and clipped MSE.

[LG-223] Null-Calibrated Conformal Selection via Target-Membership Scores

链接: https://arxiv.org/abs/2606.22336
作者: Seungjin Choi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:Conformal selection aims to identify test candidates whose unknown responses fall in a target region while controlling the false discovery rate. Existing methods often inherit prediction-oriented nonconformity scores, such as residual or clipped residual scores, from conformal prediction. We argue that the natural score for selection is instead the target-membership probability. This score directly addresses the binary event being selected, and any monotone transform of it gives the Neyman–Pearson oracle ranking at a fixed null selection level. This distinction is irrelevant for mean-monotone targets, where conventional scores induce essentially the same ranking, but becomes important for interval-valued, variance-driven, multimodal, or multi-condition targets, where prediction-oriented scores can be misaligned with selection power. We study membership-score-based conformal selection and isolate one conformal calibration route, Null-Calibrated Conformal Selection (NCCS), which ranks test scores against confirmed non-target calibration examples. Under null exchangeability, NCCS yields finite-sample valid null p-values, which can be combined with BY under arbitrary dependence or with BH under standard positive-dependence conditions. Experiments support the score principle: membership scores match conventional scores on mean-monotone targets, substantially improve over mean-score selection on variance-driven targets, and, when calibrated by NCCS, trade power for finite-sample null validity in rare-target regimes where direct empirical-FDP thresholding can be anti-conservative.

[LG-224] No Reference-Free Generalization in Quantum Machine Learning

链接: https://arxiv.org/abs/2606.22331
作者: Jeongho Bang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Quantum machine learning is often motivated by the exponentially large state space of quantum systems, but this promise leaves a basic generalization problem unresolved: how can a learner assign different meanings to unseen quantum directions when the training data provide no preferred basis, measurement frame, or other orienting structure? We address this identifiability problem by formulating supervised learning without an external quantum reference frame, so that predictions cannot depend on an arbitrary choice of Hilbert-space coordinates. This requirement forces the learned classifier to preserve every unitary symmetry left unbroken by the training data. We prove that whenever the training states fail to span the full Hilbert space, all pure states orthogonal to their span must receive the same prediction – even when those states are mutually orthogonal and perfectly distinguishable once an appropriate measurement is supplied. The limitation is therefore not caused by state discrimination, optimization, or computational power, but by missing reference information. We further establish a robust version under weak symmetry breaking and show that learning generic unstructured concepts on multiqubit systems requires exponentially many independently oriented training directions. Numerical illustrations visualize the resulting prediction collapse and its controlled relaxation. Our results identify feature maps, measurement bases, Hamiltonians, locality, symmetry priors, architectures, and sufficiently diverse training states as operational resources for generalization. The central implication is that Hilbert-space dimension alone is not a learnable feature space: successful QML must specify the physical structure that gives unseen quantum directions semantic meaning.

[LG-225] Adam Converges in Nonsmooth Nonconvex Optimization

链接: https://arxiv.org/abs/2606.22326
作者: Zijian Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Adam is one of the most widely implemented and influential modern optimizers. Why is it effective across different optimization problems in practice? This question arguably lies at the center of the optimization community over the last decade and has motivated a substantial body of work aimed at understanding its convergence behavior. However, existing studies have mainly focused on the convergence rate of Adam in smooth nonconvex optimization, which unfortunately does not adequately capture practical settings, since many real-world problems are nonsmooth, such as those arising in training neural networks. Thus, these studies cannot fully explain the popularity and empirical success of Adam. Recently, an insightful and powerful framework called Online-to-Nonconvex Conversion has opened a new way to analyze Adam for nonsmooth nonconvex optimization. Unfortunately, prior works along this line share two common limitations. First, all of them ignore the important bias-correction term in the original Adam algorithm. Second and more importantly, many of them require extra operations that are not used in Adam, such as a clipping step. Therefore, the convergence guarantee for the original Adam method still remains unclear. In this work, we present the first finite-time analysis for the classical form of Adam, i.e., with the bias-correction step and without further algorithmic modifications, and prove that a randomly scaled learning rate ensures a convergence rate of 1/T^\frac213 for nonsmooth nonconvex optimization. Moreover, our result provably applies to the modern heavy-tailed noise regime, which is closer to practice. Interestingly, our theory is established under the parameter choice \beta_1=\beta_2 , aligning with the recent empirical studies. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.22326 [math.OC] (or arXiv:2606.22326v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2606.22326 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-226] Convergence Analysis of Nyström Subsampling in Covariate Shift Adaptation for Misspecified case

链接: https://arxiv.org/abs/2606.22259
作者: Hanna Myleiko,Sergei Solodky,Vasyl Semenov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 24 pages, 3 figures

点击查看摘要

Abstract:This paper investigates convergence properties of regularized Nyström subsampling applied to the unsupervised domain adaptation problem under covariate shift. We focus on the low-smoothness (misspecified) case where the target function lies outside the reproducing kernel Hilbert space. By combining Tikhonov regularization with Nyström projection onto a subsampled subspace, we obtain upper bounds on the excess risk that hold with high probability and are expressed in terms of the source condition, the effective dimension, and the sample sizes. We further extend the analysis to the setting where the Radon-Nikodym derivative between the target and source marginal distributions is unknown and must be approximated, and we identify the minimal additional sample sizes required to maintain the same convergence rate as in the oracle case.

[LG-227] Variance-Tilted Diffusion Models for Diverse Sampling ICML

链接: https://arxiv.org/abs/2606.22239
作者: Iskander Azangulov,Leo Zhang,Kianoosh Ashouritaklimi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at SPIGM @ ICML workshop 2026

点击查看摘要

Abstract:Diffusion models are typically sampled independently, even when the downstream objective is to obtain a diverse set of candidates. We introduce a variance-weighted batch distribution that favours collections of samples with large empirical spread after a prescribed linear feature map. The target is specified explicitly, and the sampler is derived as the corresponding Doob h -transform of independent diffusion dynamics. The resulting correction has a compact form: an interaction term that repels posterior denoised means, together with a curvature term that moves particles to the region of higher feature variance. This yields an interacting-particle sampler with a transparent probabilistic target rather than a heuristic repulsive drift.

[LG-228] Deep RL for Fast Long-Horizon Operations Scheduling on NASAs Carruthers Geocorona Observatory Mission ICAPS2026

链接: https://arxiv.org/abs/2606.22159
作者: Alex Zhang,Jackson Craig,Lara Waldrop
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: Preprint, peer-reviewed, will be part of ICAPS 2026 conference proceedings

点击查看摘要

Abstract:Spacecraft operations scheduling is a highly constrained, long-horizon combinatorial optimization problem that traditionally relies on heuristics, constraint programming, or manual planning. We present a scalable deep reinforcement learning framework developed and deployed for NASA’s Carruthers Geocorona Observatory mission. Our framework introduces a macro-action abstraction known as activity blocks coupled with dynamic action-masking to navigate the intractably large search space and strictly enforce complex power, thermal, and instrument constraints. The resulting architecture generates globally feasible schedules with overwhelming probability, establishes operational trust, and executes a full training cycle in under six hours, circumventing the need for policy robustness by enabling rapid, on-demand retraining. Further, resulting schedules outperform baseline heuristics in scheduled science quality. The deep reinforcement learning framework was deployed as the default operational scheduler for the Carruthers Geocorona Observatory mission from the outset of the mission, demonstrating that deep reinforcement learning can be trusted for real spacecraft operations under complex, evolving constraints.

[LG-229] Patched Flow Matching: Generative Wall-Pressure Reconstruction Beyond Training-Domain Scales from Sparse Sensors

链接: https://arxiv.org/abs/2606.22084
作者: Meet Hemant Parikh,Yi Liu,Jian-Xun Wang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Characterizing the complete wall-pressure spectrum in turbulent wall-bounded flows requires simultaneous access to the viscous-scale high-wavenumber content and the outer-layer low-wavenumber content – a requirement that neither short-domain direct numerical simulation (DNS) nor sparse experimental measurements alone can satisfy. We propose Patched Flow Matching (Patched FM), a generative framework that fuses these two complementary sources by learning a patch-local prior over inner-scaled wall-pressure statistics from short-domain DNS and assimilating sparse sensor measurements at inference time through training-free posterior sampling. The patch-additive decomposition of the flow matching vector field decouples the generative prior from the global domain size, enabling reconstruction on domains arbitrarily larger than the training configuration. By expressing the patch prior in inner-scaled coordinates, where high-wavenumber wall-pressure statistics are approximately Reynolds-number invariant, the framework extends to higher Reynolds numbers through hierarchical transfer learning with as few as 500 short-domain snapshots ( 2.5% of the base training data) at a fraction of the scratch-training cost. Applied to compressible channel-flow DNS at Re_\tau = 180 , 500 , and 1000 , Patched FM reconstructs full-resolution wall-pressure fields on a domain four times larger than the training configuration ( L_x^L = 16\pi\delta versus L_x^S = 4\pi\delta ) from sensor coverage as low as 0.25% , recovering the low-wavenumber spectral content inaccessible to short-domain DNS with high fidelity in both streamwise and spanwise directions. Zero-shot generalization to unseen Reynolds numbers and ablation studies further confirm the role of inner scaling as a physical prerequisite for data-efficient Reynolds-number transfer.

[LG-230] Anticipating the Optimism Gap: Predicting Distribution-Shift Degradation of RF-Impairment Detectors from In-Distribution Statistics

链接: https://arxiv.org/abs/2606.22054
作者: Chakshu Baweja
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Detectors for GNSS radio-frequency impairments (jamming, spoofing, multipath) are usually reported with a single AUC measured on the distribution they were tuned on. That number falls once conditions move, and the size of the drop is rarely known in advance because labelled field data is scarce. We ask whether this optimism can be predicted before any out-of-distribution data is seen. On an open, parameter-grounded synthetic testbed with a tunable severity shift, we evaluate thirteen detectors (five physics baselines, full-feature logistic regression and multilayer perceptrons, and single-feature learned controls) across four impairment classes. The optimism gap, the difference between in-distribution and shifted AUC, grows monotonically as the shift deepens (mean Spearman correlation 0.50). It is driven by how many observables a detector uses rather than by whether it is learned, and it varies systematically by class. Centrally, a ridge model built only from in-distribution score statistics predicts the gap for a detector it has never seen (R^2 = 0.47) and for an impairment class it has never seen (R^2 = 0.46); both are significant against a 2000-fold permutation null (p 0.001) and survive removing the feature that is, by construction, part of the target. The headline findings are synthetic. We then run the pre-registered protocol on three open field corpora: on Jammertest 2024 the cross-detector prediction holds (R^2 = 0.11, p = 0.009), and on SatGrid, whose spoofer power sweep gives a calibrated severity axis, in-distribution AUC overstates higher-severity AUC by up to 0.22 and to the point of sign inversion, with in-distribution AUC and realised gap perfectly rank-correlated (Spearman rho = 1.0). The mechanism survives contact with real data, at smaller magnitude than in simulation. We release the testbed, a software-receiver front end, the ingest adapters and the protocol.

[LG-231] Prediction of Solar Flares Using Photospheric Magnetic Field Parameters with Deep Learning

链接: https://arxiv.org/abs/2606.21896
作者: Yash Chaudhary,Jason T. L. Wang,Chunhui Xu,Yan Xu,Sen Zhang
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Solar flares, particularly those of the M- and X-class, have a significant impact on human life because of their potential to disrupt critical infrastructure and communication systems on Earth. Accurate prediction of solar flares is crucial for mitigating these risks, but the black-box nature of conventional deep learning models used in flare prediction limits their trustworthiness and interpretability. In this paper, we propose a new approach to solar flare prediction using photospheric magnetic field parameters or features with deep learning. To improve model interpretability, we integrate explainable artificial intelligence (XAI) techniques, including SHapley Additive exPlanations (SHAP) and partial dependence plots (PDPs), into our prediction framework. XAI methods provide transparency by analyzing the importance and interactions of features used by our model. Specifically, SHAP values offer a global and local understanding of the features, while PDPs provide insights into feature-level trends. These techniques demonstrate the potential of XAI in deploying AI-driven solutions in high-impact applications such as solar flare prediction, paving the way for more informed decision-making in solar physics and space weather studies.

[LG-232] Signed Evidence Flow: Conflict-Aware and Stability-Calibrated Data Analysis

链接: https://arxiv.org/abs/2606.21875
作者: Jeffery Opoku,David Banahene
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 44 pages, including figures and tables. Reproducibility code and supplementary materials are available

点击查看摘要

Abstract:Modern data analysis usually gives a prediction without showing whether the evidence behind it is clear, conflicting, or stable. Two cases can have the same fitted confidence even when one has mostly agreeing evidence and the other has strong support and strong opposition. We propose Signed Evidence Flow (SEF), which combines a fitted prediction rule with signed feature attributions to measure support, opposition, conflict, and perturbation stability. We prove that confidence determines conflict exactly when it also determines total evidence mass, derive the remaining conditional variance, and state when conflict can improve loss prediction beyond confidence and other audit variables. We also connect conflict to geometric decision fragility. Across healthcare, Covertype, black-box, finance, and ten external data sets, conflict sometimes separates risk among predictions that already appear confident. Cross-fitted tests show added error-ranking information beyond confidence and attribution entropy on several data sets, including two large finance tasks. The direction is not universal: in some tasks, lowconflict cases are riskier. We therefore introduce ScopeGate, a held-out permutation diagnostic that checks the direction before SEF is used for review triage. SEF is consequently an audit tool rather than a universal risk score: it describes evidence structure, while an independent calibration sample determines whether that structure is useful in the target population. Comments: 44 pages, including figures and tables. Reproducibility code and supplementary materials are available Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 62H30, 62F40, 68T05 Cite as: arXiv:2606.21875 [stat.ML] (or arXiv:2606.21875v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.21875 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-233] Bayesian three-dimensional seismic travel-time tomography for active- and passive-source seismic data using physics-informed neural network

链接: https://arxiv.org/abs/2606.21789
作者: Ryoichiro Agata,Kazuya Shiraishi,Gou Fujie,Dan Bassett
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate 3D seismic velocity modeling through seismic travel-time tomography using both active- and passive-source data provides critical underpinning models for seismicity monitoring and hazard assessment. Because travel-time tomography is an inherently ill-posed inverse problem, UQ of the estimated models using Bayesian methods is also important for reliable downstream interpretations and analyses. However, Bayesian inference for 3D tomography based on conventional grid-based representations faces the ``curse of dimensionality’’ and severe computational bottlenecks. Consequently, rigorous Bayesian UQ for margin-wide 3D travel-time tomography has remained largely unexplored. In this study, we propose a meshless 3D Bayesian travel-time tomography method that combines PINNs with a neural representation of the velocity structure, enabling tractable and data-efficient Bayesian inference through function-space particle-based variational inference. To efficiently integrate passive-source data into the Bayesian estimation of the velocity structure, we conduct analytical marginalization treating uncertain source parameters as nuisance parameters, with passive-source relocation carried out in post-processing. We validated the capability of our approach for 3D problems through synthetic experiments. Furthermore, we applied the method to a real-world dataset from marine active-source surveys and natural earthquakes off the Kii Peninsula, Nankai Trough. Our probabilistic 3D ensemble successfully resolves key geological features and provides data-consistent uncertainty maps. The posterior mean hypocenters shifted mainly in the vertical direction by 10-15 km, consistent with a previous relocation result. Finally, the neural representation drastically reduces storage requirements for the entire ensemble velocity model, highlighting the scalability and data efficiency of the proposed framework.

[LG-234] Physics-Preserving Latent Compression for Zero-Shot Resolution Transfer in 3D Turbulence

链接: https://arxiv.org/abs/2606.21781
作者: Yilong Dai,Yiming Sun,Yiheng Chen,Ziyi Wang,Shengyu Chen,Xiaowei Jia,Runlong Yu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 4 tables

点击查看摘要

Abstract:High-resolution turbulence modeling is essential for scientific computing, but remains constrained by the cost of direct numerical simulation and the scarcity of full-resolution data. Existing scientific compressors reduce storage but typically operate on per-frame representations, whereas learned compressors yield compact latents that are often resolution-dependent and weakly aligned with the physics of turbulence. This raises the need for a compression framework that reduces data size, preserves physical diagnostics, and transfers from low-resolution training fields to high-resolution test fields without retraining. In this paper, we propose Physics-Preserving Latent Compression (PPLC), a patch-local latent compressor for three-dimensional turbulence. Motivated by inertial-range scale similarity, PPLC treats fixed-size patches as transferable units and applies a shared variational autoencoder independently of the global grid size. It combines exact mean preservation, zero-mean fluctuation encoding, an invertible Haar wavelet front-end, shift-consistency regularization, and overlap-aware reconstruction. Instantiated on forced isotropic turbulence, PPLC is trained only on stride-downsampled 256^3 fields and transfers zero-shot to 1024^3 fields. Experiments show that PPLC improves the balance between reconstruction accuracy and physical fidelity over classical and learned baselines, keeping diagnostics such as dissipation, enstrophy, energy spectra, and incompressibility closer to the ground truth. Beyond turbulence compression, PPLC offers a general strategy for physics-preserving latent representations that support data-efficient scientific surrogate modeling.

[LG-235] Physics-Informed Neural Networks for Computing the Morse Index of the Critical Catenoid

链接: https://arxiv.org/abs/2606.21725
作者: Miraj Samarakkody
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Numerical Analysis (math.NA); Spectral Theory (math.SP)
*备注:

点击查看摘要

Abstract:The Morse index of a free boundary minimal surface is encoded in its Jacobi-Steklov spectrum, and we test how faithfully a physics-informed neural network (PINN) reproduces that spectrum on a problem whose answer is already known in closed form. The benchmark is the critical catenoid in the unit ball \mathbbB^3 , where it is well known that the Morse index equals 4 and the nullity equals 2 . Separating the angular variable reduces the eigenvalue problem to a family of one-dimensional Robin problems on [-T,T] , one for each Fourier mode. A network that enforces the parity of each mode by construction, and carries the eigenvalue as a trainable parameter, returns the three eigenvalues below the stability threshold to within 10^-6 to 10^-4 of their exact values, with PDE residuals of order 10^-4 ; assembling them recovers the index 4 and the nullity 2 . We then track the spectrum along a one-parameter homotopy joining a flat reference operator to the catenoid Jacobi operator and identify the crossings at which the index changes. Since the critical catenoid is rigid, a fact we prove, this homotopy deforms operators rather than surfaces. We close by explaining how the same pipeline, with its one-dimensional solver replaced by a two-dimensional one, is poised to address genuinely geometric families in ellipsoidal balls, where the boundary curvature is no longer constant, and the Morse index is not yet known.

[LG-236] ARCO-Mars: A Unified Cloud-Optimized Archive of Mars Atmosphere Reanalysis

链接: https://arxiv.org/abs/2606.21701
作者: Ananyo Bhattacharya
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term records of the Martian atmosphere based on general circulation models and reanalysis of atmospheric state variables are important to understand the diurnal, seasonal, and climatological changes of the planet. Atmospheric dynamics of the Martian atmosphere are strongly influenced by the characterization of dust lifting, solar insolation, and spatial variations in topography. We present ARCO-Mars, a unified Analysis-Ready Cloud-Optimized dataset providing integrated access to three independent Mars atmospheric reanalysis products: EMARS, MACDA, and OpenMARS spanning over Mars Years 24-35. These reanalyses assimilate thermal infrared retrievals from the MGS/TES, ODY/THEMIS, and MRO/MCS instruments, providing both two and three-dimensional surface and atmospheric state variables, including temperature, winds, surface pressure, and dust optical depth. The dataset is stored in Zarr v3 format and hosted on HuggingFace, enabling efficient cloud-based access without requiring local storage of the full archive. We compare the state variables between the three reanalysis products to identify systematic differences, attributed to differences in data assimilation and general circulation models. ARCO-Mars provides a community resource for Mars atmospheric science, numerical weather prediction validation, and machine learning applications, including weather forecasting and data assimilation.

[LG-237] Finite-Sample Performance of Gradient Descent in Logistic Regression with Gaussian Design

链接: https://arxiv.org/abs/2606.21683
作者: Junren Chen,Arya Mazumdar
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the parameter estimation problem in logistic regression with Gaussian design: the estimation of a fixed unknown parameter \theta^\in \mathbbR^d ( |\theta^|_2\ge 1 ) from n i.i.d. samples (x_i,y_i)_i=1^n , where x_i\sim N(0,I_d) and y_i|x_i \sim \rm Bernoulli(1/(1+\exp(-x_i^\top \theta^))) . Our main aim is to characterize the finite-sample estimation performance and convergence behavior of gradient descent (GD) on the maximum likelihood objective (i.e., the logistic loss). Under small O(1) stepsize and 0 initialization, we show that GD linearly converges to a small neighborhood of \theta^ achieving an \ell_2 error of order O(\sqrt|\theta^|_2^5d/n) . This substantially goes beyond existing theoretical results that lack non-asymptotic estimation error rate and exhibit much slower parameter convergence. We also establish a faster local linear convergence to the same statistical error under a large \Theta(|\theta^|_2) stepsize. The main technical component is to show that the gradient of the logistic loss satisfies a certain approximate invertibility condition (AIC). To that end, we uniformly control the deviation of the gradient from its population counterpart by covering and peeling arguments, and then show that the population GD is a contraction by a delicate analysis based on the eigenvalues of population Hessian matrices. Finally, we build upon the recent work Matsumoto and Mazumdar (2025) and devise a novel efficient estimator that attains a sharper rate in high dimensions. This indicates that the existing non-asymptotic guarantees exhibit sub-optimal dependence on |\theta^|_2 , and that in many regimes \Theta(\sqrt|\theta^|_2d/n) is the tight estimation error rate. Numerical examples are provided to corroborate our theoretical results.

[LG-238] Accelerated and Stable Convergence with Anchored Optimistic Method

链接: https://arxiv.org/abs/2606.21528
作者: Motahareh Sohrabi,Jianxin You,Simon Lacoste-Julien,Eduard Gorbunov,Gauthier Gidel
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study first-order methods for solving monotone variational inequalities arising in min-max optimization. Classical approaches such as the extragradient method rely on two gradient queries per iteration, which limits their analysis and applicability in the online and stochastic settings. We propose a family of Generalized Optimistic Methods with Anchoring (GOMA), which combine two-time-scale optimistic updates with an anchoring term inspired by Halpern iteration. In the deterministic setting, GOMA achieves the optimal accelerated last-iterate rate O(1/k^2) on the squared gradient norm for monotone Lipschitz operators. In the stochastic setting with unbounded variance, a simplified single-call variant of GOMA achieves a last-iterate convergence rate of O(1/\sqrtk) on the squared gradient norm. To the best of our knowledge, this is the first such guarantee for stochastic monotone Lipschitz variational inequalities in the unconstrained setting without variance reduction or growing batches.

[LG-239] Central limit theorem for the averag ed Adam optimizer

链接: https://arxiv.org/abs/2606.21433
作者: Steffen Dereich,Arnulf Jentzen
类目: Probability (math.PR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 37 pages

点击查看摘要

Abstract:In this article, we analyse convergence of the averaged Adam optimizer to an attracting zero of the Adam vector field. We provide a central limit theorem that, in particular, quantifies exactly the speed of convergence. The order of convergence is n^-1/2 in the number of steps of the algorithm which coincides with the order observed for classical stochastic approximation algorithms. The covariance in the central limit theorem is given in terms of properties of the Adam algorithm in the state of the attractor.

[LG-240] Subsampling for supervised learning in reproducing kernel Hilbert spaces

链接: https://arxiv.org/abs/2606.21260
作者: Eyal Vayness,Maxime Sangnier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of big data, subsampling became a common practice in statistical learning. By selecting a subgroup of individuals based on which the learner is trained, subsampling aims at reducing the computational cost and time of the estimation step, and ideally leads to a decrease of its energy consumption and carbon footprint. This work focuses on a nonparametric setting, in which the hypotheses set lies in a reproducing kernel Hilbert space, and the estimator is a minimizer of an empirical risk reweighted à la Horvitz-Thompson. By studying the asymptotic properties of this estimator, we reveal an optimal subsampling scheme (regarding the trace of the covariance operator) and show that it can be used via plug-in. A numerical study on synthetic and real-world datasets shows the practicability and the benefit of the proposed approach.

[LG-241] Orthogonal Discrepancy Kernels for Learning with Partial Physics

链接: https://arxiv.org/abs/2606.21199
作者: Swapnil Manna,Timothy J. Rogers,Lawrence Bull
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We introduce a semi-parametric framework for nonlinear system identification, which decouples discrepancy functions from physics-based components. Orthogonal Gaussian process regression balances sparse parameter selection (the white box) with discrepancy learning (the black box) to produce interpretable models from incomplete physics.

[LG-242] wo Layers of Instability in Causal Estimation

链接: https://arxiv.org/abs/2606.21185
作者: Alexis Bellot
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:There is a precise sense in which drawing causal inferences from observational data is hard, even when identifiability is assumed. In particular, Robins and Ritov (1997) and Robins et al. (2003) showed that causal effects can be discontinuous as a function of the data distribution: two arbitrarily close data distributions might correspond to different causal effects. This is a fact independent of the choice of estimator; however, not all estimators are equally unstable. Our contribution is to surface a second layer of instability that depends on the choice of estimator. We show that many standard point estimates can be read as point summaries of multimodal distributions over the space of structural causal models. As such, estimators can jump discontinuously in the data distribution. This defines a taxonomy of estimators that admits a decision-theoretic reading: stability depends on whether the implicit loss function an estimator optimizes is aligned with the causal effect itself. Specifically, inverse propensity weighted estimators and regression estimators are examples of discontinuous summaries, while explicit posterior means and medians are shown to be continuous.

[LG-243] DUET: Decentralized Bilevel Optimization without Lower-Level Strong Convexity ICLR2025

链接: https://arxiv.org/abs/2606.21153
作者: Zhen Qin,Zhuqing Liu,Songtao Lu,Yingbin Liang,Jia Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Decentralized bilevel optimization (DBO) provides a powerful framework for multi-agent systems to solve local bilevel tasks in a decentralized fashion without the need for a central server. However, most existing DBO methods rely on lower-level strong convexity (LLSC) to guarantee unique solutions and a well-defined hypergradient for stationarity measure, hindering their applicability in many practical scenarios not satisfying LLSC. To overcome this limitation, we introduce a new single-loop DBO algorithm called diminishing quadratically-regularized bilevel decentralized optimization (DUET), which eliminates the need for LLSC by introducing a diminishing quadratic regularization to the lower-level (LL) objective. We show that DUET achieves an iteration complexity of O(1/T^1-5p-\frac114\tau) for approximate KKT-stationary point convergence under relaxed assumptions, where p and \tau are control parameters for LL learning rate and averaging, respectively. In addition, our DUET algorithm incorporates gradient tracking to address data heterogeneity, a key challenge in DBO settings. To the best of our knowledge, this is the first work to tackle DBO without LLSC under decentralized settings with data heterogeneity. Numerical experiments validate the theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.

[LG-244] Bayesian Model Averag ing under Predictor Redundancy via Density-Ratio Posterior Compression

链接: https://arxiv.org/abs/2606.21080
作者: Hanqing Li,Xuewen Lu,Yuting Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 48 pages, 6 figures. The manuscript uses the JMLR style file. Source code and reproducibility materials are available at this https URL

点击查看摘要

Abstract:Bayesian model averaging in support-indexed regression induces a posterior distribution over active predictor supports. Under predictor redundancy, posterior mass can spread across many nearly interchangeable supports, making exact-support summaries unstable or hard to interpret even when prediction is stable. We study how to report an already fitted Bayesian model averaging posterior without changing the Bayesian target. A report uses hard or soft regions of support space, and its compressed reporting law is compared with the reference posterior through an explicit density ratio. This ratio gives computable total-variation and Kullback–Leibler distortion, bounds for bounded predictive summaries, retained-mass diagnostics, and fallback-weight diagnostics. The framework covers fixed hard regions, metric-ball regions, posterior-cluster regions, and pooled-pruned region dictionaries. We prove exact error formulas and validation bounds for these region reports, and give conditions under which a few regions can replace a long list of individual supports. In simulations, our region reports often give shorter and clearer summaries while preserving the main posterior information, and the density-ratio diagnostics show when too much information has been lost.

[LG-245] Diffusion-Driven State Space Models

链接: https://arxiv.org/abs/2606.21036
作者: Jack Ruder,Michael Wojnowicz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to ProbML 2026 Workshop Track

点击查看摘要

Abstract:In many domains, practitioners seek models that produce accurate forecasts while faithfully capturing latent system dynamics. Existing approaches typically sacrifice one of these goals: deep state space models often assume Gaussian latent transitions, limiting fit and forecasting, while diffusion models are highly expressive but lack principled inference for the underlying dynamics. To combine the strengths of both, we introduce the Diffusion-Driven State Space Model (DDSSM), which replaces the conventional Gaussian transition distribution with a diffusion model. Our DDSSM resolves the open problem of how to jointly train an autoencoder and a diffusion model on sequential data, thereby extending the literature on latent diffusion models for time series. Moreover, we find that the DDSSM empirically outperforms a state-of-the-art deep SSM at fitting and forecasting a simulated time series with multimodal transitions.

[LG-246] Adversarial observations in probabilistic State-Space Models for robust Reinforcement Learning

链接: https://arxiv.org/abs/2606.20880
作者: M. Santos-Pascual,D. Ríos Insua
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 29 pages, 7 figures, PREPRINT ONGOING

点击查看摘要

Abstract:Decision-making under partial or adversarial observability requires accurate inference of the environment’s latent state and its associated uncertainty. This work analyses adversarial attacks on linear probabilistic state-space models, commonly integrated within reinforcement learning architectures, where the attacker alters observations under likelihood constraints that ensure the perturbations remains consistent. We analyze how such adversarial yet realistic observation shifts influence the latent state and influence policy decisions. This perspective provides a principled pathway toward building more robust reinforcement learning systems, with direct relevance to safety-critical domains such as robotics, where reliable operation under sensor noise, partial failures, and adversarial conditions is essential.

[LG-247] Betting on Moments: Legendre Jumper Martingales for Online Exchangeability Testing

链接: https://arxiv.org/abs/2606.20859
作者: Johan Hallberg Szabadváry
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:We present a family of conformal test martingales based on shifted Legendre polynomials, which extends the Simple Jumper martingale. The Simple Legendre Jumper substitutes the linear betting function with a polynomial of arbitrary degree, thereby facilitating the detection of variance, skewness, and higher-order deviations from uniformity; the standard Simple Jumper is a specific instance of degree one. The Product Legendre Jumper integrates multiple polynomial degrees into a unified betting function, although its state space expands exponentially-a cost we refer to as the jumping tax. To address this issue, we introduce the Variational Legendre Jumper, which factorises the joint adaptation through a mean-field approximation, thereby reducing exponential scaling to linear time with minimal loss in power. Lastly, the Composite Legendre Jumper incorporates several jumping rates, ensuring a wealth floor under exchangeability and automatic adaptation to the shift’s timescale. Empirical results from a real-world classification task demonstrate that the combined methods consistently surpass any single-degree martingale under distributional shift, and the composite variant is recommended as the default when the shift timescale is unknown.

[LG-248] ReLaTS: a Reinforcement Learning-based method for dynamically determining the coupling Time Step in multi-scale simulations of self-gravitating systems

链接: https://arxiv.org/abs/2606.20832
作者: Veronica Saz Ulibarrena,Simon Portegies Zwart
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted for publication in RASTI

点击查看摘要

Abstract:Astrophysical simulations frequently address multi-scale, multi-physics problems through subsystem decomposition, problem-tailored integration schemes, and coupling on fixed manually set timescales. Here we introduce ReLaTS, a reinforcement learning framework that dynamically selects the coupling time step to optimize the trade-off between accuracy and computational cost. We validate ReLaTS on star clusters containing a planetary system, and test the method by varying the number of stars N_\star in the cluster and the number of planets ( N_\rm planet ) orbiting one of them. The method finds the optimal coupling time step that balances speed and accuracy without requiring expert knowledge. In addition, the trained network operates independently of the coupled \textitN-body algorithms, displaying stable performance across a range of setups. We observe that the method is less reliable for cases with infinitesimal masses, as their contribution to the total energy is negligible compared to that of the massive bodies, and the network is not capable of recognizing potential errors generated while integrating them. For long-time integration of large N systems, the error accumulates. The reinforcement learning algorithm, however, manages to keep the energy error below a pre-set threshold. This approach substantially reduces energy errors relative to fixed-time step baselines without substantial additional computational overhead. Once trained, ReLaTS requires no expert tuning and generalizes across diverse astrophysical domains, enabling adaptive multi-scale simulations.

[LG-249] Dataset-Aware Cold-Start Active Learning for Annotation-Efficient 3D Medical Image Segmentation

链接: https://arxiv.org/abs/2606.20765
作者: Rémi Hattat,Marine Beaumont,Charline Bertholdt,Gabriela Hossu,Olivier Morel,Bailiang Chen
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 20 pages, 3 figures, 4 tables. Supplementary material available as ancillary file

点击查看摘要

Abstract:Deep learning for 3D medical image segmentation requires extensive manual annotations, a major bottleneck in volumetric medical imaging. Active learning aims to reduce this burden by selecting informative samples for annotation, but most methods assume that an initial labeled set is already available. This leaves the cold-start problem largely unresolved: how to select the first volumes from a fully unlabeled pool before any task-specific model is trained. We propose CSCS, a Curriculum-Stratified Cold-Start framework that adapts initial sample selection to the structure of the unlabeled dataset. CSCS combines two self-supervised, label-free signals: local typicality, measuring representativeness in the embedding space, and reconstruction-based uncertainty, used as a proxy for sample difficulty. These signals are combined through a weighted geometric score, where the weighting is determined by a closed-form pacing rule based on the effective annotation budget and the Difficulty-Coverage Ratio, a pool-level statistic measuring the alignment between difficulty and representativeness. We evaluate CSCS on four 3D medical image segmentation benchmarks: BraTS, FeTA, Spleen, and an in-house fetal MRI dataset. Using nnU-Net as downstream segmentation model, CSCS shows consistently competitive performance across datasets and annotation budgets, with the strongest gains in low-to-mid annotation regimes. These results suggest that dataset-aware cold-start initialization can improve the robustness of active learning for 3D medical image segmentation by adapting sample selection to the geometry of the unlabeled pool.

[LG-250] Beyond Importance: Interchange-Sobol Sensitivity Reveals Task-Specific Content Channels in Transformer Components

链接: https://arxiv.org/abs/2606.20678
作者: Yifeng Guo,Jin-Hong Du,Xiang Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Mechanistic interpretability methods summarize a transformer component by a single importance score, conflating two distinct roles: a component may matter because it transports task-relevant content, or because the forward computation degrades when its contribution is removed. We introduce \emphInterchange-Group Sobol Decomposition (IGSD), a paired-intervention framework that compares matched activation replacement with zero ablation on the same component, estimates two Sobol-style variance indices, and uses their signed difference to separate the two roles, with intervention validity monitored by a symmetric off-manifold diagnostic \widehat\mathrmST1 . In factual recall, IGSD identifies an early-layer content channel in both GPT-2 small and Qwen2.5-1.5B that standard importance methods underestimate. A controlled subject and relation donor design shows that the early channel transports relation-frame content while late attention transports subject-retrieval content, refining at head granularity to the known \mathrmAttn_L9H8 head. Late-layer clamping confirms that the early signal is expressed through downstream transformations rather than residual pass-through. These results show that replacement and deletion are not interchangeable controls and their divergence provides a practical statistical diagnostic for content transport in transformer components.

[LG-251] Input-schema identifiability limits in physics-informed surrogates for mechanics-governed flow

链接: https://arxiv.org/abs/2606.20655
作者: Daniel Cieslak,Andrzej Czyzewski
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Main text with Supplementary Information included; 11 figures

点击查看摘要

Abstract:Physics-informed and data-driven surrogates are increasingly used to approximate mechanics-governed flow fields, but the target quantities assigned to such models are not always identifiable from the input variables available at prediction time. We introduce an input-schema identifiability certificate for computational surrogates. Starting from a reduced physical model, the certificate decomposes a target field into components that are measurable from geometry, components that require boundary-condition information, and components identifiable only up to a symmetry quotient. This yields a pre-training audit: it predicts which oracle-channel interventions should reduce error, which should fail, and which ambiguity cannot be removed by changing the architecture, loss, optimizer, or sample size. We instantiate the framework for incompressible tubular flow using a Cosserat-rod reduction, where lumen velocity separates into a mesh-measurable tangent direction, a boundary-condition-dependent magnitude, and a signed-orientation ambiguity. Controlled experiments on patient-specific aortic CFD geometries, analytic Womersley flows, and an advection-diffusion transfer problem confirm the predicted pattern: supplying signed direction collapses angular error to the oracle regime, whereas supplying magnitude without orientation leaves the predicted sign ambiguity and yields 16-33 percent per-node sign flips. The results provide a mechanics-based diagnostic for deciding whether a surrogate modelling task is physically identifiable before training, and expose failure modes that aggregate error metrics can hide.

附件下载

点击下载今日全部论文列表