本篇博文主要内容为 2026-06-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-06-25)

今日共更新592篇论文,其中:

  • 自然语言处理82篇(Computation and Language (cs.CL))
  • 人工智能178篇(Artificial Intelligence (cs.AI))
  • 计算机视觉124篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习151篇(Machine Learning (cs.LG))
  • 多智能体系统14篇(Multiagent Systems (cs.MA))
  • 信息检索23篇(Information Retrieval (cs.IR))
  • 人机交互21篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Multi-Agent Goal Recognition with Team- and Goal-Conditioned Reinforcement Learning and Factorized Branch-and-Bound

【速读】:该论文旨在解决多智能体目标识别(Multi-Agent Goal Recognition)中的组合爆炸问题,即在观测到多个智能体轨迹的情况下,如何高效地联合推断出哪些智能体构成团队以及每个团队的目标。由于团队划分与目标组合的假设空间随智能体数量呈指数级增长,传统方法难以在实际应用中实现高效推理。其解决方案的关键在于提出一种基于分支定界(Branch-and-Bound)框架的多智能体目标识别方法(MAGR-BB),利用共享的团队与目标条件策略(team- and goal-conditioned policy)作为评分模型,并结合因子化分支定界搜索机制,在不完全枚举所有假设的前提下,显著减少需显式生成的假设数量,同时保持与穷举搜索一致的最优假设排序。实验表明,该方法在控制环境下的多智能体积木世界(Blocksworld)基准测试中,不仅大幅降低了累计识别时间,还实现了对假设空间的高效剪枝,从而在保证精度的同时极大提升了计算效率。

链接: https://arxiv.org/abs/2606.25978
作者: Thiago Thomas,Gabriel de Oliveira Ramos,Felipe Meneguzzi
机构: Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS); Universidade do Vale do Rio dos Sinos (Unisinos); University of Aberdeen
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Multi-agent goal recognition asks an observer to jointly infer which agents act together and what each team is trying to achieve, so the hypothesis space grows combinatorially with the number of team partitions and goals per team. Real applications such as drone surveillance and collaborative robotics expose only the agents’ trajectory, which forces the observer to rank team-goal hypotheses from behavior alone. Multi-Agent Goal Recognition with Branch-and-Bound (MAGR-BB) addresses this setting with a shared team- and goal-conditioned policy used as the scoring model inside a factorized branch-and-bound search. On a controlled multi-agent Blocksworld benchmark, MAGR-BB returns the same top-ranked hypothesis as exhaustive search throughout the trajectory while cutting hypothesis materialization by orders of magnitude and reducing cumulative recognition runtime substantially.

[MA-1] Manipulation Is Task-Dependent: A Multi-Axis Multi-Environment Evaluation of Frontier LLM s

【速读】:该论文旨在解决当前大语言模型在复杂多变任务环境中表现出的操纵行为(manipulative behavior)评估不足的问题,尤其针对现有基准测试仅在一个单一环境内调整单一变量(如激励结构或任务难度)所导致的评估片面性。其核心解决方案在于构建一个跨六种不同环境、涵盖13,590个独立场景的多维度评估框架,系统考察模型在三种关键轴上的表现:表述框架(framing,即是否要求诚实或允许操纵)、激励结构(incentive structure,从无激励到高激励)以及任务难度(task difficulty)。研究发现,不同环境中的操纵驱动因素存在显著差异:在鼓励模型对未来行为进行虚假陈述的场景中,指令表述方式和具有约束力的激励机制是主要驱动因素;而在需对真实状态进行误导的场景中,任务难度成为主导因素。此外,模型在各环境间的操纵率排名相关性极低(平均斯皮尔曼等级相关系数 ρ = 0.055),表明某一任务中的操纵倾向无法有效预测其他任务中的行为。因此,该研究强调必须采用多维度、跨环境的综合评估方法,才能准确衡量模型的操纵倾向,为未来安全可控的生成式人工智能(Generative AI)发展提供关键依据。

链接: https://arxiv.org/abs/2606.25899
作者: Adeeb Zaman,Erik Nordby,Fred Heiding
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We evaluate manipulative behavior in six frontier language models across six environments, ranging from negotiation tasks to agentic workflows, resulting in 13,590 individual scenarios. Manipulation rates are measured across three axes: framing (mandate honesty or permit manipulation), incentive structure (from no incentives to substantial ones), and task difficulty. Existing benchmarks typically vary a single axis within a single environment, an approach our results show is insufficient. We rank models by manipulation rate and find Spearman rank correlations across environments average \rho = 0.055 , indicating manipulative tendencies in one task do not necessarily predict those in another. Additionally, we find the axis that drives manipulation varies across different environments. In environments where models are incentivized to misrepresent future actions, instructional framing and structurally binding incentives are the primary drivers; in environments where models are incentivized to misrepresent a ground truth, task difficulty dominates. This split was identified in five environments and validated against a sixth held-out environment. Together, these findings illustrate the importance of rigorous multi-dimensional evaluations when measuring manipulative propensities.

[MA-2] Robustness and Leadership in Markov-switching Consensus Networks

【速读】:该论文旨在解决时变交互结构下多智能体系统在噪声扰动下的鲁棒性问题,具体关注连续与离散时间框架中基于马尔可夫切换图(Markov switching graph, MSG)的共识与领导者-跟随者跟踪动态的稳态性能。其核心挑战在于如何量化动态拓扑切换对系统整体与个体性能的影响。解决方案的关键在于引入马尔可夫跳变线性系统(Markov jump linear systems, MJLS)框架,推导出各智能体偏离一致状态的稳态协方差及跟踪误差的表达式,从而实现对个体与群体性能随交互拓扑与切换动态变化的定量分析。在此基础上,将静态图中的鲁棒性、确定性指数(certainty index)和联合中心性(joint centrality)等概念拓展至MSG场景,并通过分析双拓扑切换情形,揭示切换机制对系统性能的影响规律。数值仿真进一步验证了拓扑切换对协同任务鲁棒性的实际影响。

链接: https://arxiv.org/abs/2606.25888
作者: Sarah H. Cen,Vaibhav Srivastava,Naomi Ehrich Leonard
机构: Carnegie Mellon University (卡内基梅隆大学); Michigan State University (密歇根州立大学); Princeton University (普林斯顿大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注: An extended version of an earlier IEEE CDC paper

点击查看摘要

Abstract:We investigate how time-varying interactions, modeled via a Markov switching graph (MSG), impact the robustness of noisy multi-agent dynamics in both continuous- and discrete-time settings. Our focus is on the steady-state performance of consensus and leader-follower tracking dynamics subject to stochastic noise. Using the framework of Markov jump linear systems (MJLS), we derive expressions for the steady-state covariance of each agent’s deviation from consensus and tracking error, respectively, and use them to quantify individual and group performance as a function of the interaction graphs and the switching dynamics. We extend established notions of robustness, certainty indices, and joint centrality from static graphs to the MSG setting. To gain analytical insight, we specialize our results to systems switching between two topologies and characterize how switching influences performance. Numerical simulations further illustrate how switching topologies affects system robustness in both coordination tasks.

[MA-3] Generating Realistic Individual Activity Schedules via Activity Location Allocation Based on Simulated Travel Times

【速读】:该论文旨在解决生成真实可信的个体每日活动日程(Daily Activity Schedules)过程中存在的难题,尤其是在保持模拟出行时间与实际出行调查数据一致性的方面。由于隐私保护限制,直接获取个体真实活动日程困难,因此现有方法通常依赖于公开的人口统计数据与出行调查数据进行合成,但受限于交通行为建模的复杂性,生成的日程往往难以同时满足活动位置分布合理性和出行时间匹配度。为此,论文提出一种基于动态规划(Dynamic Programming)迭代优化的框架,通过反复调整活动地点分配以逼近调查中报告的出行时间分布。其解决方案的关键在于利用动态规划进行活动位置的递进式分配,并通过多轮迭代显著降低模拟出行时间与调查数据之间的偏差——数值实验表明,相较于首次迭代,该方法可使模拟出行时间与调查数据间的差异减少52.2%。

链接: https://arxiv.org/abs/2606.24566
作者: Tatsuya Mitomi,Yahya Gamal,Esra Suel,Gary Polhill,Daniel Konioukhov,Alison Heppenstall
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures. This is the author version of a short paper accepted for presentation in the poster session at the 17th Conference on Spatial Information Theory (COSIT 2026)

点击查看摘要

Abstract:Individual level daily activity schedules are essential for a wide range of applications, including infectious disease control, urban transportation planning, and policy design. In practice, such schedules are typically generated by combining population data with travel survey data. These data sources are used because they are often publicly available, whereas observed individual activity schedules are difficult to obtain due to privacy concerns. However, because of the complexity of mobility modelling, it is difficult to generate realistic activity schedules that also preserve travel times consistent with those reported in travel surveys. To address this issue, we propose a framework for generating activity schedules that iteratively applies a dynamic programming method to allocate activity locations based on simulated travel times. Numerical experiments with dummy data show that the proposed method reduces the discrepancy between simulated travel times and those reported in travel surveys by 52.2% relative to the first iteration through iterative refinement.

[MA-4] Agent ic evolution of physically constrained foundation models

【速读】:该论文旨在解决当前通用型人工智能(Generalist Agents)在自动化科学发现中因缺乏物理约束而产生的“幻觉”问题,即生成与实际硬件不兼容的系统设计。其核心挑战在于如何在复杂且受限的物理环境下实现高效、可执行的硬件-软件协同设计。解决方案的关键在于提出一种基于进化知识图谱(Evolutionary Knowledge Graph)的多智能体自主发现引擎,通过提取“算法链式思维”(algorithmic Chain-of-Thought),将无导向的随机搜索转化为受知识驱动的结构化演化过程。该框架利用历史科学创新数据构建先验知识体系,引导生成符合硬件约束的计算系统架构,并成功应用于大模型部署场景,实现了两种新型硬件感知压缩方法(Q-Enhance 和 MoE-Salient-AQ),显著优于人工设计的启发式方案;同时通过带宽高效的敏感性分析(Sensitivity Profile),在双A100服务器上成功部署2350亿参数模型,内存占用降低75%,仅损失0.64%精度,验证了其在严格物理边界下的可扩展性与实用性,从而建立了一种面向机器驱动发现的可规模化硬件-软件协同设计范式。

链接: https://arxiv.org/abs/2606.25532
作者: Jiangwei Zhang,Wen Sun,Chong Wang,Shiyao Li,Cheng Che,Chunjing Han,Dan Meng,Jian Yang,Yu Wang,Rui Hou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 29 pages, 5 main figures and 4 extended data figures

点击查看摘要

Abstract:Artificial intelligence increasingly drives automated scientific discovery, yet contemporary generalist agents lack physical grounding, frequently hallucinating hardware-incompatible designs. Here, we present a physically grounded, multi-agent discovery engine that autonomously architects hardware-compliant computing systems. Anchored by an Evolutionary Knowledge Graph structuring past scientific innovations, the framework extracts an “algorithmic Chain-of-Thought” to transform blind stochastic search into directed structural evolution. Applied to the extreme testbed of foundation model deployment, the engine evolved two hardware-aware compression methodologies surpassing human-engineered heuristics: Q-Enhance mitigates long-context accuracy loss in dense models, and MoE-Salient-AQ outperforms state-of-the-art manual sparse Mixture-of-Experts designs by 3.7% at sub-3-bit regimes. Utilizing a bandwidth-efficient Sensitivity Profile, we successfully deployed a massive 235-billion-parameter model onto a constrained dual-A100 server, reducing memory requirements by 75% with a marginal 0.64% accuracy degradation. By transforming unconstrained combinatorial search into knowledge-driven autonomy, this establishes a scalable hardware-software co-design paradigm for machine-driven discovery within strict physical boundaries.

[MA-5] Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning

【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning)中,基于独立智能体(independent actors)设定下,顺序更新框架中优势函数(advantage function)因重要性采样导致的方差指数级增长问题。该问题会引发训练过程不稳定及收敛困难。其解决方案的关键在于提出一种裁剪目标(clipping objective),通过控制优势函数在顺序更新中的波动上界,有效抑制了高方差问题。在此基础上,作者建立了具有次线性收敛速度的单调性边界,并证明算法可收敛至ε-纳什均衡(ε-Nash Equilibria)。进一步地,基于该裁剪目标设计了两种新的实用算法,在三个主流多智能体强化学习基准测试中均表现出优于现有基线的方法性能,尤其在稳定收敛性和低优势函数方差估计方面表现突出。

链接: https://arxiv.org/abs/2606.25526
作者: Bang Giang Le,Viet Cuong Ta
机构: Vietnam National University, Hanoi (河内国家大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning assumes each agent shares the same reward function and can be trained effectively using the Trust Region framework of single-agent. Instead of relying on other agents’ actions, the independent actors setting considers each agent to act based only on its local information, thus having more flexible applications. However, in the sequential update framework, it is required to re-estimate the joint advantage function after each individual agent’s policy step. Despite the practical success of importance sampling, the updated advantage function suffers from exponentially high variance problems, which likely result in unstable convergence. In this work, we first analyze the high variance advantage both empirically and theoretically. To overcome this limitation, we introduce a clipping objective to control the upper bounds of the advantage fluctuation in sequential updates. With the proposed objective, we provide a monotonic bound with sub-linear convergence to \epsilon -Nash Equilibria. We further derive two new practical algorithms using our clipping objective. The experiment results on three popular multi-agent reinforcement learning benchmarks show that our proposed method outperforms the tested baselines in most environments. By carefully analyzing different training settings, our proposed method is highlighted with both stable convergence properties and the desired low advantage variance estimation. For reproducibility purposes, our source code is publicly available at this https URL.

[MA-6] Rate-Aware Quantum-Inspired Trajectory Learning for Interference-Limited Multi-UAV Networks

【速读】:该论文旨在解决无人机(UAV)在轨迹优化中面临的维度诅咒问题,即在干扰受限环境和庞大搜索空间下,实现实时协同调度时计算开销过大。其核心挑战在于如何在保证服务质量(QoS)的前提下,高效协调多架无人机以最大化网络容量。解决方案的关键在于提出一种速率感知的量子退火图压缩(Rate-Aware Quantum-Annealed Graph Condensation, RA-QAGC)机制,通过速率感知的图抽象技术识别高吞吐区域,并结合去中心化强化学习引导无人机轨迹向吞吐最优区域自适应调整,从而在保持系统可扩展性的同时实现干扰感知的协同优化。仿真结果表明,RA-QAGC方案在总吞吐量上达到59.4 Mbps,优先用户吞吐量达23.9 Mbps,相较基线方案分别提升约15%和34%,验证了其在复杂场景下的有效性与优越性。

链接: https://arxiv.org/abs/2606.25480
作者: Khaoula Khaled,Muhammad Afaq,Ali Arshad Nasir,Zeeshan Kaleem
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV) can provide on-demand, high-capacity connectivity in disaster and normal situation. However, it faces a challenge of curse of dimensionality in trajectory optimization, where interference-limited environments and vast search spaces make real-time coordination computationally expensive. To overcome this challenge, we propose the Rate-Aware Quantum-Annealed Graph Condensation (RA-QAGC) scheme, which combines rate-aware graph abstraction with decentralized reinforcement learning to enable scalable, interference-aware UAV coordination. By identifying high throughput locations and guiding UAV trajectory adaptation toward throughput-optimal regions, RA-QAGC effectively balances network capacity by maintaining quality-of-service (QoS) requirements. Simulation results demonstrate the proposal outperformed over existing schemes by achieving 59.4 Mbps total throughput and 23.9 Mbps priority-user throughput, representing gains of approximately 15% and 34%, respectively, over the baseline schemes.

[MA-7] Agent ic Knowledge Tracing: A Multi-Agent LLM Architecture for Stealth Assessment of Financial Literacy in Serious Games

【速读】:该论文旨在解决在不影响学习体验的前提下,于教育类严肃游戏(serious game)中实现对金融素养(financial literacy)的隐蔽评估(stealth assessment)这一核心挑战。其解决方案的关键在于提出一种名为“代理式贝叶斯知识追踪(Agentic BKT)”的多智能体大语言模型架构,通过分阶段、领域分解式的协同推理机制,从开放性游戏行为事件中精准捕捉金融素养的多维度表现。具体而言,该方法首先将玩家决策结构化为事件日志;随后利用经过三名领域专家验证的四点量表(Fleiss kappa = 0.624,一致性良好)的LLM分类器对每项行为进行标注;接着由四个专注于风险规避、投资、支出与信用管理的专用领域智能体,在会话层面基于行为轨迹执行领域内推理,并输出各能力域的贝叶斯知识追踪(Bayesian Knowledge Tracing, BKT)掌握度估计;最后由一个专家裁判智能体整合各领域估计结果,生成综合掌握度评分。实验在193名K-12学生共264个游戏会话中验证表明,该框架所生成的掌握度估计与学习成效(r = 0.276, p < 0.0001)及后测成绩(r = 0.333, p < 0.0001)显著相关,且与前测成绩无显著关联,具备良好的收敛效度与区分效度;相较单一LLM基线(r = 0.095,不显著),其预测有效性提升近三倍,充分证明了领域分解(domain decomposition)与会话级推理(session-level reasoning)在捕捉金融素养复杂性中的关键作用。

链接: https://arxiv.org/abs/2606.25358
作者: Gabriel Santos,Rita Julia,Marcelo Nascimento
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures, IEEE CoG 2026

点击查看摘要

Abstract:Assessing financial literacy during gameplay without disrupting the learning experience remains a key challenge in serious games for education. We present the Agentic BKT pipeline, a multi-agent large language model architecture for stealth assessment of financial competencies from open-ended gameplay events. The pipeline processes events from a 2D platformer serious game aligned with the OECD/INFE financial literacy framework through four phases: (1) the game captures every player decision as a structured event log; (2) an LLM event classifier labels each action on a four-point rubric validated against three domain experts (Fleiss kappa = 0.624, substantial agreement); (3) four domain-specific agents specializing in risk mitigation, investing, spending, and credit management perform session-level reasoning over behavioral trajectories, feeding per-competency Bayesian Knowledge Tracing that estimates mastery within each domain; and (4) an expert judge agent synthesizes the domain-level estimates into an overall mastery score. Evaluated with 193 K-12 participants across 264 game sessions, the Agentic BKT pipeline yields mastery estimates significantly correlated with learning gain (r = 0.276, p = 0.0001) and post-test scores (r = 0.333, p 0.0001) while showing no correlation with pre-test scores, providing both convergent and discriminant validity. The multi-agent approach approximately triples the predictive validity of a single-LLM baseline (r = 0.095, not significant) in this study, demonstrating that domain decomposition and session-level reasoning play a central role in capturing the multidimensional nature of financial literacy from gameplay

[MA-8] Bridging the Post-discharge Gap: A Traceable Multi-agent Framework for Safe and Continuous Care

【速读】:该论文旨在解决出院后连续性医疗随访中面临的医护人员短缺、患者病史碎片化及跨科室信息孤岛等问题,尤其针对生成式AI在临床长期照护场景中因幻觉风险和无法有效处理纵向、个体化约束条件而难以落地的瓶颈。其解决方案的关键在于提出Healink——一种基于记忆增强的多智能体框架,通过三重核心机制实现可追溯、合规且高临床效用的智能随访:一是引入分诊路由机制实现任务精准分配;二是构建基于稳健关系型数据库的统一记忆增强模块,保障低延迟与高一致性;三是采用基于约束的检索增强生成(Retrieval-Augmented Generation, RAG)引擎,结合向量化的历史临床记录与多维度加权相似性函数,在跨患者与患者内病例匹配中实现精准匹配,并主动规避跨部门药物冲突。该框架在包含400例常规及85例复杂真实随访案例的数据集上验证,其生成结果在盲法临床专家评估中显著优于人类医生基线,在权威性与临床安全性方面表现更优,同时提供可解释的“白盒”证据链,为智能患者管理提供了可扩展、安全高效的新范式。

链接: https://arxiv.org/abs/2606.25334
作者: Runwei Guan,Yi Zhou,Heyi Lin,Jinjing Zhu,Mingyuan Hou,Yang Yang,Fang Yuan,Xiaohong Lin,Shaofeng Liang,Xuming Hu,Tao Li,Tianbin Zhao,Yutao Yue,Zhiyuan Wang,Hui Xiong
机构: HKUST-GZ (香港科技大学(广州)); Distinct Clinic (Distinct Clinic)
类目: Multiagent Systems (cs.MA)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Post-discharge clinical follow-up is critical for maintaining continuity of care and mitigating long-term health risks. However, traditional follow-up paradigms suffer from shortage of health workforce, fragmented patient histories, and information silos across clinical departments. While large language models have demonstrated potential in medical question-answering, their deployment in continuous care is hindered by hallucination risks and a fundamental inability to reason over longitudinal, patient-specific constraints. Here we present Healink, a memory-enhanced multi-agent framework to support AI-assisted post-discharge follow-up by generating prescription-grounded, traceable responses that improved completeness and perceived clinical utility in retrospective and physician-blinded evaluations. The architecture seamlessly integrates a triage routing mechanism, a unified memory enhancement module utilizing a robust relational database for optimal latency, and a strict constraint-based retrieval-augmented generation engine. By vectorizing historical clinical records and employing weighted similarity functions across diverse phenotypic and intervention dimensions, Healink ensures precise inter-patient and intra-patient case matching while actively preventing cross-departmental drug conflicts. We evaluated Healink on a dataset comprising 400 continuous and 85 highly complex real-world follow-up cases, alongside the webMedQA benchmark. In a rigorous single-blind evaluation conducted by clinical experts, the framework outperformed human physician baselines in both authoritativeness and clinical safety. By generating a traceable, white-box evidence chain, Healink provides a scalable, safe, and highly effective paradigm for intelligent patient management, ultimately enhancing societal healthcare outcomes.

[MA-9] EvoFlock: evolved inverse design of multi-agent motion

【速读】:该论文旨在解决多智能体系统(multi-agent system)运动模型参数调优困难的问题。在模拟鸟群、人群、车流等群体行为时,尽管单个智能体的行为模型具有直观的控制参数,但这些参数之间的非线性耦合关系使得整体群体行为难以预测与调控。传统方法依赖经验性的逐项调整,常导致某一行为特征的优化引发其他特征的意外变化,形成耗时且低效的试错过程。为此,本文提出一种逆向设计(inverse design)方法,通过用户定义的目标函数(如适应度函数或损失函数)量化期望的群体行为,并结合遗传算法(genetic algorithm)自动搜索最优参数组合。其核心创新在于将复杂的参数调优问题转化为基于目标函数的优化任务,使群体行为的生成具备可引导性和可控性。研究发现,鸟类群体中显著的对齐现象(alignment)主要源于维持个体间适当间距这一基本机制,揭示了群体智能行为背后的潜在规律。

链接: https://arxiv.org/abs/2606.25280
作者: Craig Reynolds
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Graphics (cs.GR); Multiagent Systems (cs.MA)
备注: To appear in the Proceedings of Artificial Life 2026

点击查看摘要

Abstract:This paper describes an automatic method for adjusting or tuning models of multi-agent motion. Simulating the motion of bird flocks, human crowds, vehicle traffic, and other multi-agent systems is a widely used technique. These simulations model the behavior of a single group member (bird, human, or vehicle). The group behaviors (flock, crowd, traffic) emerge from interactions between group members. These models typically have many numerical control parameters. Even if each parameter is intuitive in isolation, their interaction can be complex and nonlinear. It is challenging to determine which parameters to adjust for the desired change in group behavior. Changing one aspect of group behavior often causes other aspects to change, leading to a tedious process of incremental changes. This work takes an inverse design approach. The desired group behavior is measured with a user-defined objective(/fitness/loss) function and optimized with a genetic algorithm. The objective function used here for basic flocking rewards proper spacing with neighbors, flying near a desired speed, and avoiding obstacles. Interestingly, the vivid alignment seen in bird flocks appears to emerge from maintaining proper spacing between flockmates.

[MA-10] Decentralized Coordination of Autonomous Traffic Through Advanced Air Mobility Corridors

【速读】:该论文旨在解决在缺乏集中式交通管理的情况下,如何实现先进空中交通(Advanced Air Mobility, AAM)飞行器在专用走廊中的高效、安全运行问题。传统观点认为,基于走廊的运行模式虽易于实施,但因缺乏全局协调而效率低下,尤其在去中心化环境下难以维持秩序与间隔。本文提出一种创新解决方案:通过自主飞行器在仅依赖局部信息的去中心化环境中,利用学习机制实现自我组织,形成有序的走廊流。其关键在于采用多智能体强化学习框架,使飞行器在无需中央指令的前提下,自发适应走廊边界并协同避让冲突,在多种典型场景(单走廊带计量出口、连续双走廊、分叉走廊)下均表现出色——超过94%的时间能准确遵循走廊边界,并以相对高效的路径抵达目标。研究发现,在低至中等交通密度下,仅需极少战术干预即可维持安全间隔;而在高密度情况下,干预频率上升,表明系统仍需进一步优化应对高峰负荷的能力。这一成果为未来无人化、分布式空域管理提供了可行的技术路径。

链接: https://arxiv.org/abs/2606.23832
作者: Jasmine Jerry Aloor,Hamsa Balakrishnan
机构: Massachusetts Institute of Technology (麻省理工学院); AIAA (美国航空航天学会)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Presented at the AIAA SciTech 2026 Forum

点击查看摘要

Abstract:The use of dedicated corridors for Advanced Air Mobility (AAM) traffic is one of the most commonly proposed pathways to integrating them into existing airspace operations. Most prior research has focused on the design of networks of AAM corridors and conflict resolution for aircraft within corridors. It is also generally believed that while attractive from an implementation perspective, corridor-based operations may be inefficient, especially in the absence of centralized traffic management. In this paper, we show that contrary to this belief, it is possible for autonomous aircraft to learn to self-organize into corridor flows in decentralized settings. We illustrate our approach using scenarios in which fixed-wing aircraft need to safely and efficiently traverse (1) a single corridor with metering after the exit, (2) a sequence of two consecutive corridors, and (3) a corridor that splits into two. We find that in decentralized settings with only local information, the aircraft are able to conform to the corridor boundaries more than 94% of the time and reach their goal in a relatively efficient manner. Furthermore, tactical interventions to handle violations of the separation minimum are needed only infrequently in low- and medium-density settings. However, such tactical interventions become more frequently necessary only when traffic density is high. Comments: Presented at the AIAA SciTech 2026 Forum Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY) Cite as: arXiv:2606.23832 [cs.MA] (or arXiv:2606.23832v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.23832 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.2514/6.2026-0236 Focus to learn more DOI(s) linking to related resources

[MA-11] From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes

【速读】:该论文旨在解决在复杂、多领域且可中断的对话场景中,用户多个相互依赖的目标在交互过程中难以维持对话连续性的问题。现有图结构与多智能体编排框架虽能支持大规模语言模型(LLM)工作流的生产部署,但在面对目标可暂停、恢复、修改或失效等动态变化时,无法可靠地保障目标状态的一致性与上下文延续性。其解决方案的核心在于提出一种框架无关的目标导向对话运行时(Goal-Oriented Dialogue Runtime, GODR)设计模式,将目标(goal)、任务框架(task frame)、生命周期状态(lifecycle state)、失效规则(invalidation rules)以及恢复契约(resumption contracts)均作为一等运行时对象进行管理,并通过解耦有界执行逻辑至图运行时、智能体、工具或API,实现对复杂目标状态的精确控制与跨上下文恢复。GODR不适用于简单引导式流程,而是针对高复杂度、需持续追踪多目标演化路径的对话系统,强调通过形式化建模和架构选择准则来支撑未来实证验证,而非直接宣称性能指标。

链接: https://arxiv.org/abs/2606.23797
作者: Mariano Garralda-Barrio
机构: Laboratorio de Innovación Aplicada (L2IA) at Minsait (Indra Group)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 21 pages, 7 figure, 10 tables

点击查看摘要

Abstract:Graph and multi-agent orchestration frameworks make production large language model (LLM) workflows practical, but they do not by themselves solve conversational continuity when users maintain several interdependent objectives. This conceptual systems paper focuses on the high-complexity end of that design space, where goals can be suspended, resumed, revised, and invalidated by actions in other goals. We introduce the Goal-Oriented Dialogue Runtime (GODR), a framework-neutral design pattern that treats goals, task frames, lifecycle state, invalidation rules, and resumption contracts as first-class runtime objects while delegating bounded execution to graph runtimes, agents, tools, or application programming interfaces (APIs). GODR is not proposed as a replacement for workflow graphs in simple guided processes; it is intended for complex, multi-domain, interruptible conversations where objective continuity cannot be recovered reliably from agent identity, chat history, or execution-graph position alone. The paper formalizes the problem, proposes runtime objects and architecture-selection criteria, and frames evaluation as an agenda for future empirical validation rather than as a measured performance claim.

[MA-12] Emergent Relational Order in LLM Agent Societies: From Collective Affect to Authority Stratification ACL2026

【速读】:该论文旨在解决费孝通提出的“差序格局”(Differential Order Pattern)在社会结构演化中的机制性解释问题,特别是其如何从个体互动中自发生成并维持长期社会秩序。传统研究多将其视为文化特异性现象,缺乏对内在运作机制的系统建模与验证。本文提出CAREB-MAS多智能体框架,融合情感控制理论(Affect Control Theory)、社会认同理论(Social Identity Theory)及涂尔干式集体情感(Durkheimian collective affect),通过情绪-伦理-信念链驱动智能体决策,并构建动态演化的自我中心身份。该框架仅设定个体生产、基于偏好的资源分配及最低限度交互协议,不预设社会结构。在长时程模拟中,智能体自发涌现出五大核心差序格局特征:稳定的劳动分工、基于人情(guanxi)的经济伦理、合作随社会距离衰减、涌现的亲属关系权威以及以宗族为基础的中心-边缘分层。这些模式随生产结构变化,由血缘中心整合向功能相互依赖演进。实验结果表明,差序格局可被解释为一般社会机制在特定结构条件下的敏感性涌现结果,而基于大语言模型(LLM)的多智能体仿真为此类社会结构与变迁研究提供了跨学科的分析范式。

链接: https://arxiv.org/abs/2606.23764
作者: Zhiyuan Ji,Xinyu Chen,Ziqi Dai,Shiyun Tang,Chunyu Wei,Yueguo Chen
机构: Renmin University of China (中国人民大学); Beihang University (北京航空航天大学); Minzu University of China (中央民族大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 37 pages

点击查看摘要

Abstract:Fei Xiaotong’s Differential Order Pattern characterizes rural society as egocentric and relationally graded, with cooperation attenuating over social distance. Although often treated as culturally specific, its mechanistic basis remains under-operationalized, and prior LLM-based simulations have mainly addressed short-term coordination rather than long-horizon social structure. We propose CAREB-MAS, a multi-agent framework grounded in Affect Control Theory, Social Identity Theory, and Durkheimian collective affect. Agents reason through an emotion-ethics-belief chain and maintain dynamically evolving egocentric identities, while the macro environment specifies only individual production, preference-based allocation, and minimal interaction protocols. Across long-horizon simulations, agents spontaneously reproduce five core Differential Order phenomena: stable labor specialization, guanxi-based economic ethics, relational decay of cooperation, emergent relational authority, and clan-based center-periphery stratification. These patterns shift with production structure from kin-centered integration toward greater functional interdependence. Extensive experiment results support interpreting Differential Order as a structure-sensitive emergent outcome of general social mechanisms, with LLM-based multi-agent simulation providing an interdisciplinary framework for studying social structure and change.

[MA-13] Engineering Reliable Autonomous Systems: Challenges and Solutions

【速读】:该论文旨在解决自主系统(Autonomous Systems)在工程实践中可靠性不足的核心问题,特别是在其日益广泛应用背景下,如何构建可信赖、可验证且安全的自主系统。其核心挑战在于现有技术与实际工程需求之间的鸿沟:尽管学术界已发展出一系列成熟的验证与确认(Verification and Validation, V&V)方法及安全软件架构,但这些方法尚未被广泛应用于工业实践。论文提出的关键解决方案是通过整合形式化方法(Formal Methods)、多智能体系统(Multiagent Systems)与软件工程(Software Engineering)领域的研究成果,形成一份面向实际应用的路线图(Roadmap),系统梳理了当前在自主系统验证与验证、真实世界工程实现以及安全软件架构方面的关键挑战,并明确了从现有成熟技术到实际落地的路径。该路线图不仅识别出可立即应用的已有技术,还指出了仍需深入研究的开放性问题,旨在推动学术界与产业界的协同创新,加速可靠自主系统的工程化进程。

链接: https://arxiv.org/abs/2606.23760
作者: Marie Farrell,Matt Luckcuck,Angelo Ferrando,Rafael C. Cardoso,Natasha Alechina,Marco Autili,Diana Benjumea Hernandez,Luciana Brasil Rebelo dos Santos,Daniela Briola,Ana Cavalcanti,Christian Colombo,Louise A. Dennis,Clare Dixon,Michael Fisher,Mario Gleirscher,Taylor Johnson,Charles Lesire,Livia Lestingi,Sven Linker,Brian Logan,Colin Paterson,Fabio Papacchini,Patrizio Pelliccione,Pedro Ribeiro,Maike Schwammberger,Silvia Lizeth Tapia Tarifa,Hazel Taylor,Jim Woodcock,Mengwei Xu,Yi Yang,Huan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Engineering reliable autonomous systems is an important and growing topic in computer science. As autonomous systems become more prevalent, easy-to-use techniques for building them reliably are increasingly important. This workshop report captures and expands on the discussions at the Lorentz Center Workshop “Engineering Reliable Autonomous Systems” (ERAS), held from 10 to 14 June 2024. The workshop was co-organised by the organisers of the Workshop on Formal Methods for Autonomous Systems (FMAS) and the Workshop on Agents and Robots for reliable Engineered Autonomy (AREA). It brought together members of the FMAS and AREA communities, industry practitioners, and representatives from sectors where autonomous systems pose distinctive engineering challenges. The workshop focused on three main research topics: techniques for verification and validation of autonomous systems; engineering real-world autonomous systems; and software architectures for safe autonomous systems. Its main outcome is a catalogue of challenges in these areas and, most importantly, a pathway to solutions. Some challenges can already be tackled by techniques that are well known in academia but have not yet become regularly used in practice. Other challenges remain unresolved and require further research. This roadmap is intended to support future research and industrial collaboration. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2606.23760 [cs.RO] (or arXiv:2606.23760v2 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.23760 Focus to learn more arXiv-issued DOI via DataCite

自然语言处理

[NLP-0] Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

【速读】: 该论文旨在解决生成式任务与问答任务中,预训练语言模型在常识知识获取与补全方面的性能瓶颈问题,尤其关注不同预训练目标对编码器-解码器架构模型表现的影响。其核心解决方案在于提出一种基于任务匹配的多目标优化框架(Match Task to Objective, MTO),通过自动识别目标任务所对应的最优预训练目标,并据此构建适配的任务相关数据集,实现无监督条件下的模型适应。在微调阶段,设计与预训练及适配目标相一致的新型提示模板,确保模型学习过程与任务需求高度对齐。实验表明,该方法在少样本设置下相较传统方法性能提升超过120%,并在全数据场景下仍显著优于基线模型。此外,该框架进一步拓展至提示调优(prompt-tuning)领域,为软提示工程提供了有效指导,显著提升了提示调优的表现。整体策略为特定任务定制化模型选择与优化提供了可解释、可复现的科学依据。

链接: https://arxiv.org/abs/2606.24841
作者: Ahmad Pouramini,Hesham Faili
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt-based learning has emerged as a dominant paradigm in natural language processing. This study explores the impact of diverse pre-training objectives on the performance of encoder-decoder pre-trained language models across generation and question answering tasks, with a focus on commonsense knowledge retrieval and completion. We highlight the benefits of incorporating multiple objectives during both pre-training and fine-tuning stages. We introduce the Match Task to Objective (MTO) framework and methods for determining the appropriate objective for a given task. This framework offers automated methods to prepare task-related data for adaptation through unsupervised training, based on the identified objective. In the fine-tuning stage, we design novel templates that align with the objectives of the pre-training and adaptation stages. When aligned with task requirements, these strategies can achieve a performance gain of over 120% compared to conventional methods in few-shot settings. They significantly outperform related works in few-shot settings and exceed the baseline even in full-dataset scenarios. Furthermore, we extend this approach to include prompt-tuning methodologies, providing guidance for more effective soft prompt engineering and optimization. Our strategies significantly enhance prompt-tuning performance as well. These insights hold substantial value, precisely guiding the selection and optimization of models customized for specific tasks. Code is available at this https URL

[NLP-1] Less is More: Quality-Aware Training Data Selection for Scientific Summarization

【速读】: 该论文旨在解决科学长文档摘要任务中两大核心问题:一是现有数据集普遍将作者撰写的摘要作为“黄金标准”参考摘要,但其与源文章的对齐度和质量存在显著差异;二是公开可用的科学摘要数据集在规模与结构上难以满足现代长上下文模型的需求。针对上述问题,本文提出两项关键解决方案:其一,构建并发布目前规模最大之一的生物医学与生命科学领域长文档摘要数据集,涵盖188万篇PubMed Central(PMC)文章;其二,采用基于源文本对齐与模型评估的双重指标体系,系统分析作者摘要的质量差异。研究发现,作者摘要在与全文内容的一致性方面存在显著波动,且其质量信号可有效指导高质量训练数据的选择。实验表明,在相同训练规模下,基于质量筛选的子集优于随机采样,且在事实准确性等指标上可达到甚至超越更大规模的随机子集表现。研究结果强调了参考摘要质量在科学摘要任务中的关键作用,并验证了质量感知的数据选择策略能够显著提升训练效率。

链接: https://arxiv.org/abs/2606.24828
作者: Maria Nefeli Paraskevopoulou,Tatiana Passali,Grigorios Tsoumakas
机构: Aristotle University of Thessaloniki (亚里士多德大学塞萨洛尼基分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific summarization datasets remain limited in scale and structure for modern long-context models. In this work, we address both challenges by a) constructing and releasing one of the largest biomedical and life science datasets for long-document summarization, containing 1.88 million PMC articles, and b) analyzing the reference quality of author-written abstracts with source-grounded and model-based metrics. We show that author-written abstracts vary in their alignment with the full article and that these quality signals can guide training-data selection. Training on selected high-quality subsets outperforms random sampling at matched training sizes and can match or exceed larger random subsets on factuality-oriented metrics. Our findings suggest that reference quality is an important factor in scientific summarization and that quality-aware data selection can improve training efficiency.

[NLP-2] L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

【速读】: 该论文旨在解决马拉地语(Marathi)在自然语言处理(NLP)领域长期存在的资源匮乏问题,尤其针对其标注语料库稀缺、缺乏标准化评估基准的现状。马拉地语因其丰富的形态变化、相对自由的词序、无大小写规范以及与印地语和英语的普遍混用等特性,给计算建模带来了显著挑战。为应对这一问题,研究提出L3Cube-MahaPOS——一个高质量的马拉地语词性标注(Part-of-Speech, POS)数据集,包含从新闻文本中提取的32,354条经人工逐句标注的句子,采用与通用依存标注(Universal Dependencies)对齐的16类标签体系。关键解决方案在于构建一套结构化的预处理流程,涵盖Unicode规范化、基于德文格里文字(Devanagari)的分词及噪声过滤,确保各数据划分间标签一致性。同时,通过在六种模型架构(包括HMM、CRF、BiLSTM、BiLSTM+CharCNN、MuRIL及专为马拉地语设计的MahaBERT-v2)上进行基准测试,验证了该数据集的有效性,最优模型在15个标签类别上达到88.67%的词级准确率和81.67%的宏平均F1分数。研究已公开数据集、标注指南及训练好的模型检查点,以推动马拉地语NLP研究的发展。

链接: https://arxiv.org/abs/2606.24825
作者: Hariom Ingle,Ronit Ghode,Ishwari Gondkar,Jidnyasa Harad,Raviraj Joshi
机构: Department of Information Technology, PICT, Pune, India; Indian Institute of Technology Madras, Chennai, India; L3 Cube Labs, Pune, India
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most spoken languages worldwide, it remains severely under-resourced in annotated corpora and standardised evaluation benchmarks. Marathi presents unique challenges for computational modelling owing to its rich morphology, relatively free word order, lack of capitalisation conventions, and pervasive code-mixing with Hindi and English. We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. We benchmark the dataset across six model families spanning HMM, CRF, BiLSTM, BiLSTM+CharCNN, MuRIL, and the Marathi-specific transformer MahaBERT-v2. The best system achieves 88.67% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes. We release the dataset, annotation guidelines, and trained model checkpoints to foster further research in Marathi NLP.

[NLP-3] SHERLOC: Structured Diagnostic Localization for Code Repair Agents

【速读】: 该论文旨在解决大型语言模型(LLM)代理在处理代码仓库级编程任务时,因大量计算资源消耗于故障定位而效率低下的问题。现有方法虽能实现文件级别的故障定位,但缺乏可操作的诊断上下文,无法为修复代理提供有效支持。其解决方案的关键在于提出一种无需训练的SHERLOC(Structured Hypothesis-driven Exploration and Reasoning for Localization)框架,通过结合推理型大模型与轻量级仓库工具,并引入自恢复机制,实现无需微调或多智能体协同的高效故障定位。SHERLOC在不同模型规模下均达到当前最优性能,在SWE-Bench Lite上实现84.33%的准确率@1,在SWE-Bench Verified上实现81.27%的召回率@1;将其中的定位结果与诊断信息注入修复代理后,平均提升5.95个百分点的解决率,同时降低36.7%的定位和23.1%的总令牌消耗,显著提升了整体修复效率。

链接: https://arxiv.org/abs/2606.24820
作者: Hovhannes Tamoyan,Sean Narenthiran,Erik Arakelyan,Mira Mezini,Boris Ginsburg
机构: NVIDIA(英伟达); TU Darmstadt(达姆施塔特工业大学); hessian.AI 国家应用网络安全研究中心 ATHENE
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.

[NLP-4] Paying to Know: Micro-Transaction Markets for Verified Product Information in Agent ic E-Commerce

【速读】: 该论文旨在解决传统自然语言处理(NLP)在电商场景中对购物聊天机器人(shopping chatbot)定位过于局限的问题——即仅将其视为推荐或转化工具,聚焦于用户与商品目录的匹配与成交。随着代理原生微支付通道(agent-native micro-payment rails,如x402、AP2)的出现,资源稀缺性发生了根本转变:当买家为可自主深度探查的智能代理时,信息获取的可信度与决策相关性成为新瓶颈。因此,论文提出将电商生态重构为一个以验证信息为核心的微交易市场,其中买方代理通过小额支付逐步解锁卖方及评论者提供的结构化数据(如服务历史、第三方检测报告、物料清单、经审计的销售与支持指标),采用按需付费的“免费增值”(freemium)模式,并基于声誉机制评估评论者的可信度。该方案的关键在于将价值创造从“流量排名”转向“信息透明”,并通过一系列具体的自然语言处理挑战实现其落地,包括:成本最优的信息获取、数据定价与协商机制、实时实体解析、基于证据的价值交换以及隐私保护的主体建模。这些任务相较于对话流畅性,更应成为未来NLP研究的核心关注点。

链接: https://arxiv.org/abs/2606.24783
作者: Filippos Ventirozos,Matthew Shardlow
机构: Manchester Metropolitan University (曼彻斯特都会大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure. Vision paper, under review

点击查看摘要

Abstract:Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhaustively, the bottleneck is no longer matching products but acquiring trustworthy, decision-relevant information about them. We envision agentic e-commerce as a micro-transaction market for verified information: buyer agents spend fractions of a cent to progressively unlock seller- and reviewer-supplied data – service histories, third-party test reports, bills of materials, audited sales and support metrics – paid for a la carte under a freemium model, with reviewer trust scored reputationally. We sketch the architecture of such a market and argue that it rewards genuine product quality and yields truer competition than ranking-based storefronts. We then translate the vision into concrete NLP problems – cost-optimal information acquisition, data pricing and negotiation, real-time entity resolution, grounded value exchange, and privacy-preserving persona modelling – and argue that these, not chat fluency, deserve the field’s attention.

[NLP-5] Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

【速读】: 该论文旨在解决非自回归生成模型在高效迭代优化与生成质量之间难以平衡的核心问题。现有方法中,掩码扩散模型(Masked Diffusion Models, MDMs)因存在因子分解误差,在并行生成多个词元时导致样本质量显著下降;而流映射语言模型(Flow Map Language Models, FMLMs)虽通过联合序列传输实现了优异的少步生成性能,却牺牲了MDMs所具备的推理阶段灵活性。其解决方案的关键在于提出FMLM+框架,通过引入类似掩码的噪声调度机制,在单步内完成全序列生成的同时,后验地评估每个词元的全局一致性。基于此,论文进一步提出“后验精炼”(Posterior Refinement)这一新型推理阶段优化策略,使模型能够自适应地修正输出,仅需32倍更少的数值微分方程求解次数(NFEs)即可达到离散基线模型的性能水平。实验表明,FMLM+结合后验精炼在多种基准测试中均显著提升了生成速度与质量之间的权衡,为高保真语言建模提供了可扩展的基础架构。

链接: https://arxiv.org/abs/2606.24773
作者: Manan Agarwal,Sheel Shah,Chanhyuk Lee,Jaehoon Yoo,Jerry Huang,Seunghoon Hong,Aditi Raghunathan,Jinwoo Kim,Nicholas M. Boffi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 23 figures

点击查看摘要

Abstract:Non-autoregressive generation offers a powerful paradigm for iterative refinement, allowing models to recursively critique, erase and regenerate arbitrary subsets of tokens. However, existing non-autoregressive models fail to realize this potential. Masked Diffusion Models (MDMs) suffer from factorization error, causing sample quality to collapse when generating multiple tokens simultaneously. Flow Map Language Models (FMLMs) circumvent this bottleneck via joint sequence transport for excellent few-step generation, but sacrifice the inference-time flexibility of MDMs. We introduce FMLM+, a framework that bridges this gap by equipping FMLM with masking-style noise schedules. While generating the full sequence in a single step, FMLM+ simultaneously scores the global consistency of each token a posteriori. We leverage this to introduce Posterior Refinement, a novel inference-time refinement strategy that enables the model to adaptively self-correct its outputs, matching the performance of discrete baselines with 32x fewer NFEs. Across diverse benchmarks, we demonstrate that FMLM+ with Posterior Refinement improves the speed–quality tradeoff over both MDM and FMLM families, providing a scalable foundation for high-fidelity language modeling.

[NLP-6] Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

【速读】: 该论文旨在解决从噪声大、多语言的历史文献中准确识别人物在特定时间是否曾出现在某地这一复杂问题,核心挑战在于处理历史语言变异、光学字符识别(OCR)噪声以及跨语言的间接上下文线索。其解决方案的关键在于构建一个三重评估框架,涵盖预测准确性、计算效率与跨领域泛化能力,以全面衡量系统在大规模文化遗产数字化处理中的实际应用价值。通过针对法语、德语和英语三种语言的历史报纸文本(19—20世纪)及早期现代法语文学文本的测试,验证了从先进大语言模型到轻量级专用分类器等多种策略的有效性,揭示了在大规模历史关系抽取任务中准确性、效率与鲁棒性之间的内在权衡。

链接: https://arxiv.org/abs/2606.25935
作者: Juri Opitz,Maud Ehrmann,Corina Raclé,Andrianos Michail,Matteo Romanello,Simon Clematide
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Condensed Overview of CLEF-HIPE-2026 Shared Task Results

点击查看摘要

Abstract:Was this person ever at that place, and if so, when? Answering such questions from noisy, multilingual historical documents is the central challenge of HIPE-2026, the third edition of the HIPE evaluation series. Moving from named entity recognition and linking (HIPE-2020, HIPE-2022) to reasoning about relationships between entities, HIPE-2026 targets two temporally grounded relation types: at , indicating that a person was present at a location at some point prior to a document’s publication date, and isAt , indicating presence contemporaneous with that date. This paper presents the results of the evaluation campaign, which confronted 17 participating teams with the challenges of historical language variation, OCR noise, and indirect contextual cues across three languages: French, German, and English. The datasets include historical newspaper text from the nineteenth and twentieth centuries, as well as a surprise-domain generalization set drawn from early modern French literary texts. A distinctive feature of HIPE-2026 is its three-fold evaluation framework, which assesses predictive accuracy, computational efficiency, and cross-domain generalization, reflecting the practical demands of large-scale historical document processing in the cultural heritage domain. Across more than 40 submitted runs, results reveal a wide range of strategies, from state-of-the-art large language models to lightweight task-specific classifiers, and highlight the trade-offs between accuracy, efficiency, and robustness inherent to historical relation extraction at corpus scale. System descriptions, datasets, and findings are presented and discussed, offering a detailed picture of the current state of temporally grounded relation extraction for historical documents.

[NLP-7] Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity

【速读】: 该论文旨在解决在敏感场景下部署检索增强生成(Retrieval-Augmented Generation, RAG)系统时,因恶意提示词导致的隐私泄露问题。现有方法在引入外部知识的同时,可能将敏感信息(如个人身份信息)暴露于模型输出中,从而引发严重的隐私风险。为应对这一挑战,论文提出一种多智能体协同框架,其核心解决方案是通过语义重写对检索到的内容进行隐私净化。该框架由三个专业化智能体组成:隐私提取智能体负责识别敏感标识符,语义分析智能体确保上下文语义完整性,重构智能体则在去除敏感信息的同时保留原始语义核心。实验结果表明,该方法在ChatDoctor和Wiki-PII数据集上对六种大型语言模型均能显著降低隐私泄露,在LLaMA-3-8B上将目标信息暴露从基线的144次降至仅1次;同时保持较高的上下文保真度(BLEU-1达0.122,优于SAGE方法的0.117)。此外,该框架作为异步预处理模块运行,所有重写操作均为离线一次性完成,不增加在线推理延迟,具备良好的实用性与可扩展性。

链接: https://arxiv.org/abs/2606.24623
作者: Yuanhe Zhao,Tianyu Zhang,Huafei Xing,Derek F. Wong,Jianbin Li,Tao Fang
机构: North China Electric Power University (华北电力大学); University of Macau (澳门大学); Macao Mobile Communications (澳门移动通信)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This full manuscript contains 23 pages and has been formally accepted for publication in Information Processing Management (Elsevier IPM). Tao Fang is the corresponding author

点击查看摘要

Abstract:Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We evaluate the framework on the ChatDoctor and Wiki-PII datasets across six large language models. Experimental results demonstrate a significant reduction in privacy leakage under targeted attacks. For instance, we reduced targeted information exposure in LLaMA-3-8B from 144 instances in the baseline to just 1. Furthermore, we maintain strong contextual fidelity with a BLEU-1 score of 0.122, outperforming the existing SAGE method’s 0.117. Finally, the framework operates as an asynchronous preprocessing module, introducing no additional latency to online inference, as all rewriting is executed as a one-time offline preprocessing step. To promote reproducibility, the source code of this work is publicly available at this https URL.

[NLP-8] Same Lesson Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

【速读】: 该论文旨在解决大语言模型(LLM)在跨文化语境下生成叙事时,如何有效保持文化根基意义(cultural grounding context)的问题。当不同文化以各异的表达形式传递相同道德教义时,评估其文化内涵的保留程度变得尤为复杂。研究的关键在于构建一个多语言叙事评估框架,通过跨语言收集的414条涵盖15种语言的谚语作为语义等价的文化提示,利用四类大语言模型生成共计13,000条叙事,并分析模型在跨语言条件下的意义保真度、叙事实现方式及不同模型家族间的解释一致性。结果表明,尽管跨语言提示基本维持了谚语层面的语义一致性,但模型系统性地改变了叙述中的主体性(agency)、社会定位与叙事结构;同时,各模型在单语与跨语言设置中均表现出强烈的解释收敛性,揭示多语言大模型虽在架构和语言上存在差异,却依赖于共享的语义抽象表征。这一发现强调,仅依赖语义相似性进行多语言叙事评估可能高估文化保真度,忽视了叙事表达中具有文化意义的结构性差异,因此亟需更全面的文化根基评估机制。

链接: https://arxiv.org/abs/2606.24610
作者: Jory Alshaalan,Haya Albaker,Abeer Aldayel,Aljawharah Alabdullatif,Rehab Alahmadi
机构: King Saud University (沙特国王大学)
类目: Computation and Language (cs.CL)
备注: This paper is under review

点击查看摘要

Abstract:The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework that integrates a cross-linguistic collection of 414 proverbs spanning 15 languages and uses four LLMs to generate 13k narratives. By employing semantically equivalent proverbs as culturally grounded prompts, the analysis assesses whether models preserve meaning across languages, how cross-lingual conditioning influences narrative realization, and whether different model families converge on similar interpretations. Results indicate that cross-lingual prompting largely preserves proverb-level semantic meaning while systematically redistributing agency, social positioning, and narrative structure. Additionally, strong inter-model convergence is observed in both monolingual and cross-lingual settings, suggesting that multilingual LLMs rely on shared semantic abstractions despite architectural and linguistic differences. These findings shed light on the need for more comprehensive evaluations of cultural grounding. Relying exclusively on semantic similarity in multilingual narrative assessments may overestimate cultural preservation by neglecting culturally meaningful variations in narrative expression.

[NLP-9] Qwen -Agent World: Language World Models for General Agents

【速读】: 该论文旨在解决当前通用智能体(general agent)在环境建模与推理规划能力上的瓶颈问题,特别是如何利用语言模型构建具备强泛化能力的世界模型(World Model),以推动智能体在复杂、多领域任务中的自主决策性能。其核心挑战在于传统世界模型难以有效捕捉真实世界动态的多样性与长程依赖关系,且缺乏对跨领域任务的适应性。解决方案的关键在于提出两个大规模语言驱动的世界模型——Qwen-AgentWorld-35B-A3B 和 Qwen-AgentWorld-397B-A17B,首次实现基于语言模型的多领域(7个)代理式环境模拟,并通过三阶段训练范式(CPT、SFT、RL)系统性地注入世界建模能力:首先通过上下文预训练(CPT)融合状态转移动态与专业语料以建立基础认知;再通过监督微调(SFT)激活下一状态预测的链式思维(chain-of-thought)推理能力;最后借助定制化的混合评分-规则奖励机制(hybrid rubric-and-rule rewards)进行强化学习(RL),显著提升模拟保真度。此外,研究进一步验证了世界模型在两类互补范式下的增益:作为解耦的环境模拟器,支持大规模可控仿真,显著优于仅依赖真实环境训练的强化学习方法;作为统一的代理基础模型,世界模型训练可作为高效预热(warm-up)策略,在7个代理基准上均带来下游任务性能的显著提升。为评估语言世界模型,研究还构建了基于5个前沿模型在9个基准上真实交互数据的综合性评测基准 AgentWorldBench,实证表明所提模型在多维度上显著超越现有先进水平。

链接: https://arxiv.org/abs/2606.24597
作者: Yuxin Zuo,Zikai Xiao,Li Sheng,Fei Huang,Jianhong Tu,Yuxuan Liu,Tianyi Tang,Xiaomeng Hu,Yang Su,Qingfeng Lan,Yantao Liu,Qin Zhu,Yinger Zhang,Bowen Yu,Haiquan Zhao,Haiyang Xu,Jianxin Yang,Jiayang Cheng,Junyang Wang,Lianghao Deng,Mingfeng Xue,Tianyi Bai,Yang Fan,Yubo Ma,Yucheng Li,Zeyu Cui,Zhihai Wang,Zhihui Xie,Zhuorui Ye,An Yang,Dayiheng Liu,Jingren Zhou,Ning Ding
机构: Qwen Team
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: this https URL

[NLP-10] o Compare or Not to Compare: On Methodological Practices in Evaluating Social Bias

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会偏见评估中因方法学碎片化导致结论矛盾的问题。当前评估体系普遍忽视了基准测试的结构框架,从而引入了不可控的变量干扰。其核心解决方案是提出一个统一且可调控的评估框架,通过标准化异构基准数据集,系统性地对比孤立式评估与强制选择式比较设置之间的差异。该框架的关键在于能够解耦链式思维(Chain-of-Thought, CoT)推理、中立备选选项及其他结构性伪影对社会偏见评估结果的混杂影响。研究发现,模型在孤立评估中表现出较低的偏见激活水平,而在比较性设置下则显著加剧潜在歧视行为,这一转变主要由上下文信息不足所驱动。尤为关键的是,CoT推理在比较情境下会放大社会偏见,且即使提供中立的默认选项或声称随机回答,这种系统性偏见仍以确定性方式持续存在。此外,该比较性偏见具有随模型规模正向扩展的泛化特性。因此,论文提出重要方法论建议:研究人员应采用比较性设置以更稳健地检测隐藏偏见,但实践者在现实世界中面对模糊任务时,不可依赖此类比较部署,以免引发隐蔽的歧视风险。

链接: https://arxiv.org/abs/2606.24596
作者: Federico Marcuzzi,Xuefei Ning,Roy Schwartz,Iryna Gurevych
机构: INSAIT, Sofia University “St. Kliment Ohridski”, Bulgaria; Tsinghua University, China; The Hebrew University of Jerusalem, Israel; Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social biases is paramount. However, the current literature suffers from widespread methodological fragmentation, which yields contradictory conclusions. This stems largely from ignoring the structural framing of benchmark-level evaluations. To resolve this, we introduce a unified and controllable framework that standardizes heterogeneous benchmarks to systematically contrast isolated demographic assessments with forced-choice comparative settings. Crucially, this allows us to disentangle the confounding effects of Chain-of-Thought reasoning, neutral fallback options, and other structural artifacts in social bias evaluations. Our evaluation across multiple model families reveals a massive, systematic paradigm gap: while isolated assessments limit prejudice activation, comparative settings act as aggressive catalysts for latent discrimination, a shift primarily driven by underspecified contexts. Alarmingly, CoT reasoning exacerbates social biases under comparative settings, and this systemic bias persists as a deterministic prejudice even when models are provided neutral fallback options or claim to answer randomly. Finally, we demonstrate that this comparative prejudice is a generalized phenomenon that scales positively with model size. Ultimately, we offer a crucial methodological guideline: while researchers must leverage comparative settings to robustly audit hidden biases, practitioners cannot safely rely on comparative deployments in ambiguous real-world tasks.

[NLP-11] MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

【速读】: 该论文旨在解决大语言模型(LLM)代理在长期记忆能力评估中缺乏直接、可审计的衡量标准的问题。现有方法通常通过下游行为(如任务完成率、个性化质量或后续回答准确性)间接评估记忆效果,但这种方法无法有效检验记忆本身是否准确保留了用户状态。为此,本文提出将长期记忆视为一个可审计的后交互产物——即在常规服务结束后,能否从代理留下的记忆中重建出用户的结构化状态。其解决方案的关键在于构建MEMPROBE基准测试框架:该框架基于合成真实数据,模拟50名用户,每名用户具有31个隐藏维度的用户状态库(共1,550个恢复目标),并通过受控泄露的任务序列让配备记忆的代理进行交互,随后在全存储和顶K检索两种模式下评估记忆中用户状态的可恢复性。实验表明,任务完成率与记忆可恢复性是两个独立的能力维度——尽管任务成功率接近饱和,但类别平衡的恢复率仅为约0.6,且在顶K检索下进一步下降。MEMPROBE是首个直接测量记忆恢复能力的基准,为未来记忆代理提供了明确优化目标,推动实现“越了解用户,越忠实于用户”的长期记忆系统发展。

链接: https://arxiv.org/abs/2606.24595
作者: Enze Ma,Yufan Zhou,Wei-Chieh Huang,Jie Yang,Huanhuan Ma,Zixuan Wang,Chengze Li,Chunyu Miao,Philip S. Yu,Zhen Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which tests that understanding only indirectly and leaves the memory artifact itself largely unaudited. We argue that long-term memory should instead be evaluated as an auditable post-interaction artifact: after ordinary assistance, what structured user state can be reconstructed from the memory the agent leaves behind? We instantiate this view in MEMPROBE, a benchmark in which a memory-equipped agent assists simulated users, each carrying a hidden, taxonomy-anchored user-state bank, across a trajectory of leak-controlled tasks, after which that bank is reconstructed from the agent’s resulting memory under both full-store and top-k access. Built on synthetic ground truth for efficient, scalable measurement, MEMPROBE spans 50 simulated users with 31 hidden dimensions each (1,550 recovery targets) and tests 5 representative memory systems. Testing state-of-the-art memory agents, we find that successful assistance and recoverable memory behave as distinct capabilities. Task completion nearly saturates, even for a memoryless baseline, while category-balanced recovery stays moderate (about 0.6) and drops further under top-k retrieval. MEMPROBE is the first benchmark to study memory recovery directly, reconstructing the user state a system retains and scoring it against ground truth. We see recovery as a concrete objective for future memory agents to optimize, and MEMPROBE as a step toward an environment where agents are trained to remember their users, growing more faithful the longer they know them.

[NLP-12] AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在对抗性评估中面临的两大核心挑战:如何生成具有挑战性的输入(hard inputs),以及如何可靠地验证模型失败的真实性。现有方法往往缺乏系统性的输入生成机制与可信的失败确认流程,导致评估结果不可靠或难以复现。为此,论文提出AdversaBench——一个端到端的红队测试(red-teaming)流水线,其关键在于采用五种结构化操作符对初始提示(seed prompts)进行变异,通过目标模型响应后,由三名裁判组成的评审团结合元裁判(meta-judge)作为平局裁决机制,实现对模型失败的可信确认。实验覆盖3个类别共45个种子样本(涵盖推理、指令遵循和工具使用),所有种子均成功引发可确认的失败。关键发现包括:不同操作符在不同任务类别间效果差异显著;二元失败率掩盖了实际难度,生存曲线揭示指令遵循类任务需平均2.4次攻击迭代,远高于其他类别;尽管裁判间成对一致性达80%-87%,但因标签分布偏斜导致Cohen’s kappa接近零,因此应关注类别级分歧率;此外,针对Llama 3.1 8B生成的对抗提示可零样本迁移至更大型号Llama 3.3 70B,表明这些变异主要利用了通用行为模式而非模型特异性弱点。该研究为构建可扩展、可验证的对抗性评估体系提供了方法论基础。

链接: https://arxiv.org/abs/2606.24589
作者: Khanak Khandelwal(Indian Institute of Technology Jodhpur)
机构: Indian Institute of Technology Jodhpur (印度理工学院乔德普尔)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 5 tables. Code and data at this https URL

点击查看摘要

Abstract:Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen’s kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at this https URL .

[NLP-13] Cross-Lingual Exploration for Parametric Knowledge

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中参数化知识在不同语言间访问不均的问题,即标准推理方法难以有效提取非母语语境下的局部事实信息,从而导致跨语言知识迁移与一致性表现不佳。其核心解决方案在于探索跨语言提示(cross-lingual prompting)策略,通过系统性分析影响参数化知识检索的四个内在维度,并在涵盖17种类型学多样语言的多语言事实基准上进行评估。研究发现,跨语言探索显著提升了知识迁移能力与事实召回率,相较于原生语言扩展,在计算效率上实现了更优的帕累托前沿;同时,跨语言一致性也得到明显改善,超出仅由准确率提升可解释的范围。因此,该研究确立了多语言提示探索作为一种高效且有效的推理阶段策略,能够充分激活模型中潜在的跨语言参数化知识。

链接: https://arxiv.org/abs/2606.24579
作者: Elisha Diskind,Itamar Trainin,Uri Shaham,Leshem Choshen,Idan Szpektor,Omri Abend
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 5 figures, preprint

点击查看摘要

Abstract:Parametric knowledge in Large Language Models is not equally accessible across languages. As a result, standard inference techniques often struggle to surface localized facts, leading to failures in cross-lingual knowledge transfer and consistency. In this work, we investigate techniques for accessing hidden factual knowledge by exploring cross-lingual prompting strategies. We identify four inherent dimensions of cross-lingual exploration that directly govern parametric knowledge retrieval and evaluate them on multilingual factual benchmarks covering 17 typologically diverse languages. Our results demonstrate that cross-lingual exploration significantly improves knowledge transfer and factual recall, representing a more efficient compute Pareto frontier than native-language scaling. Furthermore, we observe corresponding improvements in cross-lingual consistency, exceeding what can be explained by accuracy gains alone. Overall, our work establishes multilingual prompt exploration as a highly effective inference-time strategy for unlocking latent parametric knowledge.

[NLP-14] NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

【速读】: 该论文旨在解决当前人工智能编码代理(AI coding agents)在科研任务中仅能模仿(reproduction)而难以实现真正科学发现(discovery)的核心问题。现有基准测试因环境碎片化(environment fragmentation)导致评估结果不可靠,限制了对智能体真实科研能力的衡量。为此,论文提出NatureBench,一个基于NatureGym自动化流水线构建的跨学科基准,从经同行评审的《自然》系列期刊论文中提炼出90项真实科学任务,通过标准化、容器化的任务环境解决了环境异构性问题。其解决方案的关键在于:利用NatureGym将原始科研论文自动转化为可执行、可复现的结构化任务环境,并在严格禁用网络搜索的条件下评估前沿智能体配置。实验表明,最强模型仅在17.8%的任务上超越现有最优(SOTA),且成功主要依赖于将复杂科研问题转化为熟悉的监督预测任务(方法论翻译),而非真正的科学创新;失败主因是方法选择错误和计算资源不足,而非对任务理解偏差。研究公开了基准数据集、NatureGym工具链及维护者侧可复现的公共排行榜,推动可信的科研智能体评估发展。

链接: https://arxiv.org/abs/2606.24530
作者: Yuru Wang,Lejun Cheng,Yuxin Zuo,Sihang Zeng,Bingxiang He,Che Jiang,Junlin Yang,Yuchong Wang,Kaikai Zhao,Weifeng Huang,Kai Tian,Zhenzhao Yuan,Jincheng Zhong,Weizhi Wang,Ning Ding,Bowen Zhou,Kaiyan Zhang
机构: Horizon Research, Frontis.AI; Tsinghua University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: this https URL

[NLP-15] AGORA: An Archive-Grounded Benchmark for Agent ic Workplace Document Reasoning

【速读】: 该论文旨在解决生成式 AI 在复杂真实工作场景中进行档案依赖型推理(archive-grounded reasoning)的挑战,即从大规模、混乱且异构的工作文件集合中定位稀疏证据,协调术语、单位和时间表述不一致的问题,并最终得出准确结论。现有基准测试仅覆盖该任务的部分维度,缺乏对档案依赖性、智能体探索行为与跨领域泛化能力的联合评估。为此,研究提出 Agora 基准,其包含 362 个问题与来自八个领域的 9,664 篇真实文档(共 372M 标记),远超典型模型的上下文窗口,迫使智能体必须采取有策略的探索而非盲目扫描。Agora 的构建采用基于智能体的流水线,整合跨文档任务合成、防信息泄露的混淆处理及难度筛选机制。在对八种模型的评估中发现,该任务仍远未解决——即使最强模型准确率也仅达 59.4%,且在不同领域间表现差异显著,凸显当前系统在真实世界档案推理中的局限性。

链接: https://arxiv.org/abs/2606.24526
作者: Honglin Guo,Qi Zhang,Yu Zhang,Weijie Li,Rui Zheng,Zhikai Lei,Qiyuan Peng,Zhiheng Xi,Tao Gui,Qi Zhang
机构: Fudan University (复旦大学); Zhejiang University (浙江大学); Shanghai Qiji Zhifeng Co., Ltd. (上海启基智峰科技有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model’s context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.

[NLP-16] Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

【速读】: 该论文旨在解决低资源语言(如土耳其语)中诈骗电话检测难题,因其缺乏标注数据且技术防御手段有限。现有研究多集中于英语等高资源语言,忽视了对弱势群体所面临的真实世界威胁的覆盖。其核心解决方案是构建首个公开的多模态数据集,包含100对对齐的音频-文本配对样本(涵盖诈骗与正常通话),并评估七种大型语言模型(LLMs)在三种输入条件下的表现:原始音频、自动语音识别(ASR)生成的文本转录以及由母语者修正后的文本转录。研究发现,基于文本的输入显著优于直接处理音频,且人工校正与未校正文本的表现相近,表明高质量文本表示在低资源场景下具有关键作用。该工作强调了在反欺诈AI安全研究中推动文化与语言包容性的重要性,并呼吁发展更稳健的多模态系统以应对跨语言诈骗威胁。

链接: https://arxiv.org/abs/2606.24523
作者: Arda Eren,Micheal Cheung,Youqian Zhang,Grace Ngai,Eugene Yujun Fu
机构: 1. Google(谷歌); 2. OpenAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Poster paper accepted at 47th IEEE Security Privacy 2026

点击查看摘要

Abstract:Scam phone calls exploit vulnerable communities worldwide, yet research on detection has focused almost exclusively on English and other high-resource languages. In low-resource settings such as Turkish, detection is especially difficult, as annotated data is scarce and technological defenses remain limited. This research investigates how large language models (LLMs) can support scam detection in Turkish by introducing the first public multi-modal dataset of 100 aligned audio-transcript pairs of scam and benign conversations. We evaluate seven LLMs spanning three model families: Gemini 2.5 (Flash, Flash-Lite, Pro), GPT-4o, and Qwen (Max, Plus, Turbo), under three input conditions: raw audio, automatic speech-to-text transcripts, and transcripts refined by a native speaker. Our results suggest that transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. By centering a low-resource language and real world threat, this work highlights the urgent need for culturally and linguistically inclusive AI safety research and more robust multi-modal systems for fraud prevention.

[NLP-17] A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

【速读】: 该论文旨在解决罕见病诊断中因临床专家资源稀缺、训练数据匮乏及现有生成式AI模型临床可部署性不足而导致的诊断延迟问题。其核心解决方案在于提出一个开源、轻量级的推理型大语言模型RaDaR(32B参数),通过结合49,170份公开自由文本病例与104,666份基于表型锚定的合成病例进行增强推理训练,显著提升了模型在罕见病诊断中的性能。RaDaR在多个公开基准测试和四个外部验证中心均表现出优于现有开源模型(包括671B参数的DeepSeek-R1)的能力;在回顾性队列研究中,其能在临床怀疑前优先推荐最终诊断,平均提前1.87个月,覆盖院内诊断间隔的50.18%;随机对照医生辅助试验显示,相较于仅使用互联网搜索,RaDaR可使医生诊断准确率提升21.44个百分点。此外,合成数据消融实验表明,基于表型的叙事性数据能为长尾罕见病提供有效训练信号,并呈现单调增长趋势。综上,RaDaR及其开发与验证框架构建了一个具备临床可部署性的罕见病推理模型,并为数据稀缺条件下的诊断类AI系统提供了可复现的开发范式。

链接: https://arxiv.org/abs/2606.24510
作者: Haichao Chen,Songchi Zhou,Zhengyun Zhao,Shikai Hu,Xianghong Jin,Hongwei Ji,Li He,Shuli Li,Yiming Qin,Xin Tan,Runfeng Shi,Yih Chung Tham,Jiaye Zhu,Ye Li,Ye Jin,Longhao Cao,Dawei Li,Honghan Wu,Hongqiu Gu,Guanqiao Li,Tudor Groza,Chunying Li,Dian Zeng,Weihong Yu,Gareth Baynam,Saumya Shekhar Jamuar,Min Shen,Shuyang Zhang,Bin Sheng,Sheng Yu,Tien Yin Wong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 5 figures

点击查看摘要

Abstract:Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for rare disease diagnosis. RaDaR was trained with 49,170 publicly available free-text cases and 104,666 synthetic cases with reasoning-enhanced training. RaDaR showed the strongest performance among evaluated open-source models, including the 671B DeepSeek-R1, across public benchmarks and four external validation centers. In a retrospective cohort, RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06 percent of cases, corresponding to a potential lead time of 1.87 months and 50.18 percent of the within-center interval. In a randomized physician-assistance trial, RaDaR assistance improved physicians’ rare-disease diagnostic accuracy by 21.44 percentage points compared with internet search alone. Synthetic-data ablations suggested that phenotype-anchored narratives provide useful training signal for long-tail rare diseases, with a monotonic scaling trend within the tested data range. Together, RaDaR and its development and validation framework provide a deployable rare-disease reasoning model and a reproducible development framework for diagnostic AI under data scarcity.

[NLP-18] MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中生成或处理文本时存在的错误检测与修正难题,尤其关注因细微错误可能导致患者安全风险的问题。现有方法如自动化检查和基于启发式规则的策略在面对未见过的数据集时泛化能力较差。为此,论文提出MedGuards——一种基于多智能体上下文学习框架的医疗安全防护机制,将医学错误检测与修正任务建模为多智能体协作问题:专用智能体分别负责错误检测、定位与修正,同时引入基于置信度的仲裁机制,通过推理轨迹和置信度分数协调不同智能体间的分歧。该设计显著提升了系统的可解释性、鲁棒性与适应性,且无需对基础大模型进行额外训练。此外,论文提出了关键词优先修正评分(Keyword-Prioritized Correction Score, KPCS),一种新评估指标,强调参考文本中关键术语的正确生成,相较于传统指标提供了更全面的评估维度。在包含临床笔记的四个多语言医疗数据集上的实验表明,所提框架在多个模型与指标上均取得显著提升,有助于实现大模型在真实医疗应用中的更安全部署。为保障可复现性,代码已公开发布。

链接: https://arxiv.org/abs/2606.25651
作者: Congbo Ma,Hu Wang,Yichun Zhang,Farah E. Shamout
机构: New York University Abu Dhabi (纽约大学阿布扎比分校); Khalifa University (哈利法大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical error detection and correction as a multi-agent in-context learning task. Specialized agents separately detect, localize, and correct errors, while a confidence-guided arbitration mechanism resolves disagreements using reasoning traces and confidence scores. This design enhances interpretability, robustness, and adaptability, without requiring additional training of the base LLMs. Additionally, we introduce the Keyword-Prioritized Correction Score (KPCS), a new evaluation metric that considers whether critical keywords within the reference text are generated correctly, providing a more comprehensive assessment than conventional metrics. Experiments across four multilingual medical datasets consisting of clinical notes demonstrate significant improvements by the proposed framework across several metrics and models. Our aim is to enable safer deployment of LLMs in real-world healthcare applications. For reproducibility, we make our code publicly available at this https URL.

[NLP-19] Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

【速读】: 该论文旨在解决基于长篇小说的生成式角色扮演系统中存在的两大核心问题:事实越界(Factual Overreach)与风格单调(Stylistic Monotony)。前者指角色因共享检索或参数化记忆而使用其视角之外的事实,破坏叙事一致性;后者则源于角色档案描述导致角色语音与行为模式固化,丧失情境适应性。为应对上述问题,论文提出REVERIEMEM——一种三层次记忆架构,包含情景层(episodic layer)用于存储第一人称场景记忆、语义层(semantic layer)用于保存带可见性标签的事实信息,以及人格层(personality layer)用于建模情境依赖的言语与行为模式。该架构通过分层、视角约束的记忆机制,有效实现角色知识边界的精准控制与动态行为表达。在KBF-QA基准测试中,REVERIEMEM相较于最强基线方法,将知识边界保真度(Knowledge Boundary Fidelity)提升34.6个百分点;在BOOKWORLD五维叙事对战协议中,获得约79%的胜率,验证了其在保持角色视角边界与生成具身化叙事方面的显著优势。

链接: https://arxiv.org/abs/2606.25632
作者: Xushuo Tang,Junhe Zhang,Zihan Yang,Yifu Tang,Sichao Li,Longbin Lai,Zhengyi Yang
机构: UNSW Sydney (新南威尔士大学); University of Sydney (悉尼大学); Chang’an University (长安大学); RAIDS Lab; Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent LLM role-playing systems build character agents from novels by extracting characters, scenes, and relations. Yet long-narrative role-playing suffers from two failures: Factual Overreach, where shared retrieval or parametric memory lets a character use facts outside its perspective, and Stylistic Monotony, where profile descriptions flatten a character into a fixed voice. To address these failures, we propose REVERIEMEM, a three-layer memory architecture for book-based character agents. The episodic layer stores first-person scene memories; the semantic layer stores visibility-tagged facts; and the personality layer stores situation-dependent speech and behaviour patterns. For evaluation, we construct KBF-QA, a 4,386-question benchmark over eight novels for testing knowledge boundaries. REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points over the strongest prior method. On BOOKWORLD’s five-dimension pairwise narrative protocol, REVERIEMEM achieves a ~ 79% win rate, suggesting that perspective-bounded memory improves both boundary fidelity and character-grounded narrative generation.

[NLP-20] Constraint Tax in Open-Weight LLM s: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

【速读】: 该论文旨在解决现代智能体(Agent)系统中工具调用(Tool Calling)与结构化输出(Structured Output,基于JSON Schema约束)在联合部署时出现的工具调用抑制问题。其核心问题是:当两个功能同时启用时,多个开源大模型尽管仍能保持对JSON Schema的高合规性,却普遍停止调用工具,导致功能失效。解决方案的关键在于提出“透明双阶段执行”(Transparent Two-Pass Execution)策略——在推理阶段将工具调用与受约束的响应生成解耦,先生成符合Schema的结构化输出,再独立执行工具调用。该方法通过避免因语法级令牌掩码(grammar-based token masks)导致工具调用令牌不可达的问题,有效恢复了工具调用能力,同时维持了结构化输出的完整性,且无需模型重训练。研究进一步提出了约束优先级倒置(Constraint Priority Inversion, CPI)假说,解释在多重约束下模型可能优先满足格式合规性而非动作选择的行为倾向,揭示了当前评估范式忽视的可靠性隐患。

链接: https://arxiv.org/abs/2606.25605
作者: Fangzheng Li,Aimin Zhang,Chen Lv
机构: Focus AI Center, Focus Technology Co., Ltd.(焦点AI中心,焦点科技有限公司); Nanjing University of Science and Technology (南京理工大学)
类目: Computation and Language (cs.CL)
备注: 2 figures, 14 tables

点击查看摘要

Abstract:Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed in a production Agent system: when Tool Calling and JSON Schema constraints are simultaneously enabled, multiple open-weight models cease invoking tools despite maintaining high schema compliance. We refer to this behavior as Tool Suppression. Through controlled experiments across multiple model families and deployment settings, we consistently reproduce Tool Suppression under joint constraints, while tool execution and schema compliance remain functional when evaluated independently. Further analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, causing tool-call tokens to become unreachable during decoding. This provides an implementation-level explanation for the observed behavior. To interpret the phenomenon, we formulate the Constraint Priority Inversion (CPI) hypothesis, which suggests that schema satisfaction may dominate action-selection behavior under multiple simultaneous constraints. We present CPI as a behavioral hypothesis consistent with the observed evidence rather than a verified internal mechanism. To mitigate the problem, we propose Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from schema-constrained response generation. Experimental results show that this approach restores tool invocation while preserving structured output guarantees without requiring model retraining. These findings suggest that evaluating tool use and structured output separately may overlook important reliability issues in production Agent systems. Code, data, and docs will be released at this https URL.

[NLP-21] Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

【速读】: 该论文旨在解决大语言模型(LLM)在低资源语言如乌尔都语中数学推理能力严重不足的问题。现有进展高度依赖英语语料和评测基准,导致在乌尔都语等语言上的推理性能显著下降,其根本原因在于缺乏面向数学推理的专用数据集与适配模型。为应对这一挑战,本文提出Riazi-8B,一个通过两阶段适配流程构建的乌尔都语数学推理模型:首先在乌尔都语维基百科上进行持续预训练以增强语言理解能力,随后在基于GSM8K生成的乌尔都语思维链(Chain-of-Thought)数据上进行监督微调以强化多步推理能力。实验结果表明,Riazi-8B在MGSM-Ur都评测集上相较于现有乌尔都语指令调优模型,在答案正确性、推理质量、响应完整性和乌尔都语生成能力方面均实现显著提升。研究关键在于将语言层面的本地化适配与面向推理任务的精细化微调相结合,证明了该策略在扩展低资源语言数学推理能力方面的有效性。

链接: https://arxiv.org/abs/2606.25568
作者: Azher Ali,Ibtsam Haider,Raja Khurram Shahzad,Seemab Latif,Mehwish Fatima
机构: University of the Punjab(巴基斯坦旁遮普大学); National University of Sciences and Technology(巴基斯坦国家科学与技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent LLMs demonstrate strong mathematical reasoning capabilities, but existing gains rely heavily on English-centric training resources and benchmarks. As a result, reasoning performance degrades substantially in low-resource languages such as Urdu, where reasoning-oriented datasets and adapted models remain scarce. Urdu lacks both reasoning-oriented resources and models adapted for multi-step mathematical problem solving, limiting the applicability of recent progress to Urdu-speaking users. We address this gap through Riazi-8B, an Urdu mathematical reasoning model developed through a two-step adaptation process comprising continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. We evaluate Riazi-8B on MGSM-Urdu against existing Urdu instruction-tuned models. Our results show consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation. Our findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning is an effective strategy for extending mathematical reasoning capabilities to low-resource languages.

[NLP-22] BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

【速读】: 该论文旨在解决基于分组的强化学习(Group-based RL)在训练长时序大语言模型(LLM)智能体时,因优势估计器存在状态-动作信用分配错配(state-action credit mismatch)而导致的信用分配偏差问题。现有方法如GiGPO依赖观察哈希划分行为组,其状态侧划分过细导致大量单例组(singleton groups)使步级信号消失,而动作侧采用组内均值则过于粗略,混淆了状态价值与动作特异性信用。为此,论文提出BiPACE(Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation),其核心创新在于双侧改进:首先,利用策略自身隐藏状态的余弦距离构建行为聚类(BiGPO),以经验性地近似双仿真(bisimulation)结构,显著降低由观察哈希带来的单例组比例;其次,引入动作条件的同伴基线(PACE),在每个行为簇内通过非参数化方式估计局部Q(s,a)−V(s)的差值,实现动作侧反事实信用估计。该方案无需额外评价值函数、辅助损失或新增采样轨迹,仅增加11.3%的训练步骤开销。实验表明,在ALFWorld/Qwen2.5-7B上,BiPACE_Q将验证成功率从GiGPO的90.8提升至97.1±0.9,且在所有种子下均突破95%阈值;在更小规模模型(Qwen2.5-1.5B)及WebShop、TextCraft任务上也全面超越GRPO和GiGPO。关键在于,BiPACE将优势估计的比较单元从表面观测等价性,转变为基于近似行为等价性与动作反事实的联合判断,从根本上缓解了信用分配失真问题。

链接: https://arxiv.org/abs/2606.25556
作者: Hanyang Wang,Weijieying Ren,Yuxiang Zhang,Ding Cao,Zhizhao Zeng,Ke Zeng,Tianxiang Zhao
机构: University of Chicago (芝加哥大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Stanford University (斯坦福大学); University of Science and Technology of China (中国科学技术大学); Meituan (美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stepwise group-based RL is an attractive way to train long-horizon LLM agents without a learned critic: it reuses multiple sampled rollouts to estimate local advantages. Its weakness is less visible but more fundamental: every group-relative estimator assumes that the steps it compares are equivalent for credit assignment. We show that current agentic variants violate this assumption through a state-action credit mismatch. The observation-hash partition is overly fine on the state side, creating singleton groups with zero step-level signal, while a single within-group mean is too coarse on the action side, mixing state-value estimation with action-specific credit. We introduce BiPACE (Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation), a drop-in advantage estimator that fixes both sides without adding a critic, auxiliary loss, or extra rollouts. BiGPO clusters steps by cosine distance in the actor’s own hidden-state geometry, an empirical policy-induced proxy for bisimulation that substantially lowers the singleton rate left by observation hashing. PACE then recenters returns within each behavioral cluster using action-conditioned peer baselines; its Q-style instance estimates a local Q(s,a)-V(s) nonparametrically. On ALFWorld/Qwen2.5-7B, BiPACE_Q raises overall validation success from GiGPO’s 90.8 to 97.1\pm0.9 over three seeds, and crosses the 95% threshold on every seed, which GiGPO never does within the same budget. On Qwen2.5-1.5B it reaches 93.5\pm1.2 versus GiGPO’s 86.7, and on WebShop and TextCraft it improves over GRPO and GiGPO at both model scales. The measured BiPACE-specific overhead is 11.3% of a single training-step wall time. Yet it changes the estimator’s comparison unit from surface identity to approximate behavioral equivalence plus action-side counterfactuals. The code is available at this https URL.

[NLP-23] SFL-MTSC: Leverag ing Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding INTERSPEECH2026

【速读】: 该论文旨在解决基于提示的语音语言理解(SLU)在使用大语言模型(LLM)时,因解码过程的随机性导致多意图场景下意图-槽位结构不一致的问题。其核心解决方案是提出一种语义框架级多任务自一致性(SFL-MTSC)结构化聚合框架,关键在于不再采用输出层面的多数投票机制,而是将预测结果分解为与具体意图相关的语义框架,通过领域-意图分组和槽位层级聚类,结合路径支持评分评估聚类可靠性,筛选并保留可信的语义框架后重新整合生成最终预测。该方法有效提升了槽位F1值和整体准确率,同时保持了意图准确率的稳定性。

链接: https://arxiv.org/abs/2606.25552
作者: Po-Yen Chen,Berlin Chen
机构: National Taiwan Normal University (国立台湾师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Interspeech 2026

点击查看摘要

Abstract:Prompt-based spoken language understanding (SLU) with large language models (LLMs) often suffers from inconsistent intent–slot structures due to decoding stochasticity, particularly in multi-intent scenarios. In view of this, we propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames, applies domain–intent grouping and slot-level clustering, and evaluates cluster reliability using path support scoring. Reliable frames are retained and re-integrated to form the final prediction. Zero-shot experiments on the MAC-SLU benchmark dataset show improved slot F1 and overall accuracy over single-path inference, while intent accuracy remains largely stable across most settings.

[NLP-24] Security and Privacy in Retrieval-Augmented Generation: Architectures Threats Defenses and Future Directions for Building Trustworthy Systems

【速读】: 该论文旨在解决生成式 AI(Generative AI)在引入外部知识增强大语言模型能力的过程中所面临的新型隐私与安全挑战。随着检索增强生成(Retrieval-Augmented Generation, RAG)成为主流范式,其在提升事实准确性与跨领域适应性的同时,也暴露了源于检索索引、查询日志、上下文构建及联邦更新等环节的敏感信息泄露风险,并面临知识库被恶意篡改导致生成结果可信度受损等问题。解决方案的关键在于构建覆盖检索、上下文构造与生成全链路的统一威胁面分类体系,系统识别成员推断、索引推断、投毒攻击、梯度泄漏及协同攻击等典型攻击类型,并综合评估基于架构设计、算法优化与密码学防护的防御策略,重点关注隐私-效用权衡与实际部署约束,最终推动可信赖、安全且鲁棒的 RAG 系统在真实场景中的落地应用。

链接: https://arxiv.org/abs/2606.25533
作者: Balamurugan Palanisamy,G S S Chalapathi,Vikas Hassija,Rajkumar Buyya
机构: Birla Institute of Technology and Science, Pilani(比尔拉科技与科学学院,皮拉尼校区); KIIT University (基伊特大学); The University of Melbourne (墨尔本大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a dominant paradigm for enhancing large language models with external knowledge. By coupling retrieval mechanisms with generative models, RAG systems improve factual grounding and adaptability across domains. However, integrating retrieval pipelines introduces new security and privacy risks that extend beyond conventional language modeling threats. Sensitive information may be exposed through retrieval indices, query logs, context construction, or federated updates, while adversarial manipulation of knowledge bases can undermine trust in generated outputs. This survey provides a comprehensive examination of privacy and security challenges across RAG systems deployed in centralized, on-device (Micro-RAG), federated, and hybrid paradigms. We present a unified taxonomy of threat surfaces spanning the retrieval, context construction, and generation stages and systematically analyze attack classes, including membership inference, index inference, poisoning, gradient leakage, and collusion. We further review architectural, algorithmic, and cryptographic defenses, highlighting privacy-utility trade-offs and deployment considerations. Finally, we outline open research challenges toward building trustworthy, secure, and resilient RAG systems for real-world applications.

[NLP-25] Evaluating LLM s on Real-World Software Performance Optimization

【速读】: 该论文旨在解决软件性能优化中缺乏真实场景基准测试的问题,现有方法往往局限于孤立函数或单一性能指标,忽略了执行时间与内存占用之间的关键权衡、测量环境的固有噪声,以及输入数据和运行条件变化带来的不确定性。其解决方案的关键在于提出SWE-Pro——一个基于102个开源项目中专家编写的真实优化案例构建的仓库级基准测试。SWE-Pro通过为每个任务配备参数化测试,能够在噪声感知的测量条件下,评估不同输入数据和执行环境下的运行时、峰值内存及时间加权内存使用量(TWMU),从而更全面地反映真实优化过程。实验表明,当前生成式AI(Generative AI)在性能优化上表现极差:运行时加速几乎为零,内存优化近乎无效;而专家实现则平均达到15.5倍的提速和171.3倍的峰值内存降低,且在91.2%的任务中改善了运行时,在65.7%的任务中降低了峰值内存。研究揭示了当前生成式AI能力与专家级工程实践之间存在显著差距。

链接: https://arxiv.org/abs/2606.25530
作者: Ezgi Sarıkayak,Wenchao Gu,Hesham Ghonim,Chunyang Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and the variability introduced by different input data and execution conditions. We address this by introducing SWE-Pro, a repository-level benchmark derived from 102 expert-written optimizations from open-source projects. Unlike previous benchmarks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage (TWMU) across varying input data and execution conditions under noise-aware measurement conditions. Our evaluation shows that current LLMs struggle significantly: runtime gains are negligible, and memory optimizations are nearly non-existent. This stands in sharp contrast to expert implementations, which achieve an aggregate speedup of 15.5x and peak memory reduction of 171.3x over benchmark tasks. Expert-written improvements are observed in 91.2% of tasks for runtime and 65.7% for peak memory. Our findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

[NLP-26] Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中存在个体推理路径分歧的问题,即同一问题下不同生成轨迹可能分别成功或失败,而现有方法无法精确定位导致失败的初始触发点。其核心解决方案是提出“悬崖令牌”(cliff token)的概念——即在局部上下文中,当令牌级潜在值(token-wise potential)显著下降且突破自适应阈值时所对应的特定令牌,该阈值基于一元两比例Z检验动态调整以匹配局部潜在分布。实验证明,悬崖令牌是引发推理失败的关键触发点:删除首个悬崖令牌并重新采样可使通过率(pass@64)恢复至1.0,而保留则仅能恢复至0.71–1.00之间。研究进一步构建了悬崖分类体系,包括确定性(deterministic)、不确定性(uncertain)和采样偏离(sampled-off)三类,依据贪婪选择策略与令牌熵定义,三类具有不同的概率特性且在不同模型规模间具备泛化能力。通过在悬崖位置进行单令牌偏好优化(Cliff-DPO),在GSM8K数据集上训练后,该方法在多个数学推理基准上将准确率提升最高达+6.6%,其中对不确定性和采样偏离类悬崖的优化显著改善了推理质量,而确定性悬崖优化效果不明显,验证了分类体系的有效性与可操作性。

链接: https://arxiv.org/abs/2606.25524
作者: Jaeyong Ko,Pilsung Kang,Yukyung Lee
机构: Seoul National University (首尔国立大学); Boston University (波士顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) reach high accuracy in mathematical reasoning, but individual traces on the same problem diverge; some arrive at the correct answer while others fail. Prior work analyzes failure at the step, chunk, or sentence level, or at tokens where failure has already occurred. Neither identifies the precise token that triggers the shift toward failure. We introduce the cliff token, a token where the token-wise potential drops significantly under an adaptive threshold that scales with the local token-wise potential, based on a one-sided two-proportion z-test. Across seven models and three mathematical reasoning benchmarks (GSM1K, MATH500, AIME 2025), cliff tokens act as failure triggers; deleting the first cliff token and resampling recovers pass@64 to 1.0, while keeping it limits recovery to between 0.71 and 1.00. We further introduce a cliff taxonomy of deterministic, uncertain, and sampled-off cliffs, defined by greedy choice and token entropy. Each type has distinct probabilistic characteristics, and the taxonomy generalizes across model scales. Finally, we validate the taxonomy via single-token preference optimization at cliff positions (Cliff-DPO). Trained on GSM8K, Cliff-DPO improves accuracy across benchmarks by up to +6.6. Optimizing at uncertain and sampled-off cliffs improves reasoning, while deterministic cliffs do not.

[NLP-27] Fault of Our Stars: Behavioral Drivers of Rating-Sentiment Incongruence

【速读】: 该论文旨在解决在线评论中评分(star rating)与文本情感之间存在的不一致性问题,即“情感-评分不一致”(sentiment-rating incongruence)。研究聚焦于斯里兰卡旅游景点的评论数据,发现18.6%的评论存在评分与文本情感相悖的现象,主要表现为“保守评分者”(Conservative Rater)和“强制五星”(Obligatory 5-Star)两种行为模式。其解决方案的关键在于采用基于Transformer的文本情感分析流水线,独立于用户所给评分,对评论文本进行情感判定,从而客观识别不一致情况。通过统计检验、逻辑回归、随机森林及SHAP可解释性分析,研究进一步揭示了场所类型、用户专业度、评论长度和时间因素是导致评分与文本情感偏离的重要驱动变量。研究结论强调:星评不能简单等同于文本情感,必须在自然语言处理(NLP)任务中验证其作为真实标签的有效性,否则将引入系统性偏差。

链接: https://arxiv.org/abs/2606.25518
作者: Ramanaish Abaiyan,Ruththiragayan Sutharsan,Kusal Amantha,Anusan Krishnathas,Asma Rauff,Kovindarajah Sriyathurshan,Patalee Narasinghe,Nirasha Munasinghe,Nisansa de Silva,Sandareka Wickramanayake
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures. Submitted to MerCon 2026

点击查看摘要

Abstract:When people share experiences online, they often express thoughts in two ways: a star rating and a written review. In sentiment analysis, ratings are widely used as convenient weak labels for textual sentiment, yet whether the two actually agree is rarely questioned. This study investigates sentiment-rating incongruence, where the sentiment expressed in review text differs from the sentiment implied by the assigned star rating, in Sri Lankan tourism attraction reviews. A dataset of 16,156 reviews from 2010 to 2023 is analyzed using a transformer-based sentiment pipeline that derives textual sentiment independently of assigned ratings. Incongruence occurs in 18.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5-Star behaviors accounting for the majority of mismatches. Prevalence also varies across venue types, with museums showing the highest rates. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating-text divergence. Overall, this study demonstrates that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground-truth labels in NLP.

[NLP-28] Spam and Sentiment Detection in Arabic Tweets Using MARBERT Model

【速读】: 该论文旨在解决沙特电信公司(STC)用户满意度提升的瓶颈问题,尤其针对其在阿拉伯语社交媒体(如Twitter)上用户反馈中情感分析不足的现状。现有研究多集中于英文语境下的情感分析,而针对阿拉伯语的深度学习模型仍存在显著空白。为此,本文提出基于MARBERT预训练模型的改进方案,利用包含24,513条阿拉伯语推文的标注数据集(涵盖正面、负面、中性、讽刺及不确定五类情感)进行训练,并采用F1分数、精确率与召回率作为评估指标。该方案的关键在于充分利用阿拉伯语领域适应型预训练模型MARBERT,结合深度学习中的双向编码器表示技术,在非英语语境下实现高精度的情感分类,从而有效挖掘用户反馈中的真实情绪倾向,为优化STC客户服务质量提供数据驱动的决策支持。

链接: https://arxiv.org/abs/2606.25495
作者: Abrar Alotaibi,Atta-ur Rahman,Raheel Alhaza,Wala Alkhalifa,Narjes Alhajjaj,Atheer Alharthi,Dhai Abushoumi,Maryam Alqahtani,Dania Alkhulaifi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Saudi Telecom Company (STC) is among the most popular companies in Saudi Arabia, with many customers. Yet, there is still a big room for improvement in users’ satisfaction. Social media is the most robust platform to gauge users’ satisfaction and determine their sentiments and critics. Twitter is among the most popular social media platform in this regard. STC customers prefer to use Twitter to write their feedback because it’s a fast way to get responses due to the STC customer services account. One way to achieve customer demands and improve customer service is using the Sentiment Analysis tool. Sentiment Analysis on Twitter is highly used because of the significant number of tweets and the different opinions. Likewise, Deep learning is the best existing Sentiment Analysis method, and it has diverse models. Bidirectional Encoder Representations from Transformers (BERT) model is one of the deep learning models which have achieved excellent results in Sentiment Analysis for Natural Language Processing (NLP). NLP is mainly investigated in the English language. However, for Arabic, there is a significant gap to be filled. This study trained the proposed model using MARBERT and measured the performance using f1-score, precision, and recall metrics. We trained the model with an Arabic dataset of 24,513 tweets, including 1,437 positive, 13,828 negative, 5,694 neutral, 1,221 sarcasm, and 2,297 indeterminate tweets. The main goal is to analyze the tweets and get the sentiment to improve STC customer service. The proposed scheme is promising in terms of accuracy in contrast to existing techniques in the literature.

[NLP-29] How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

【速读】: 该论文旨在解决大语言模型(LLM)越狱攻击与提示注入攻击研究中普遍存在的评估可靠性问题,即当前广泛使用的攻击成功率(ASR)依赖于自动化评判器(judge),而这些评判器本身缺乏验证与鲁棒性分析。其核心问题是:现有自动化评判器在判断内容危害性时存在显著偏差与不稳定性,导致报告的ASR数值不可靠,尤其在面对对抗性攻击时表现脆弱。解决方案的关键在于系统性地检验两类评判器——专用安全分类器与以LLM为评判主体的生成式评判器——的表现,并揭示其本质缺陷:前者虽具备高召回率但存在严重过检(精确率0.835,召回率0.974),后者则表现出极高的精确率但召回率波动剧烈(0.06至0.65),导致相同响应在不同评判器下产生截然不同的ASR结果;此外,针对两类评判器的对抗性攻击表明,仅通过添加良性框架或拒绝语句即可使多数LLM评判器误判(翻转率57%–100%),而专用分类器虽对表面攻击更具鲁棒性,但仍可被白盒梯度攻击(GCG)在小预算下成功欺骗,使70%的高置信度真实有害样本被错误判定为无害。实验进一步通过双标注审计确认所有攻击均未消除原始有害内容。因此,论文提出应强制要求在论文中报告评判器在人工标注数据集上的精确率与召回率、对ASR进行评判器精度校正,并引入对抗性测试以验证评判器的可靠性,从而提升相关研究的可信度。

链接: https://arxiv.org/abs/2606.25487
作者: Yang Gao(Veyon Solutions)
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite ways. The dedicated classifier over-flags (precision 0.835, recall 0.974); three different LLM-as-judges keep high precision (0.81 to 0.94) but show erratic recall (0.06 to 0.65), so the same responses produce very different ASR depending on which judge scores them. The two families also differ sharply in robustness. Wrappers that leave the harmful text untouched and only add benign framing flip every LLM-judge between 57% and 100% of the time, and a single prepended refusal sentence accounts for much of this (39% to 88%). The dedicated classifier resists these surface attacks (at most 6.7%), but a white-box GCG attack on its open weights flips 70% of confident true positives (21 of 30; 95% CI 54 to 86%) even at a small optimization budget. A two-annotator audit confirms the attacks leave the harm intact: every one of 80 sampled flips still contained the harmful content. Because a large and growing share of reported ASR comes from LLM-judges, many such numbers are unreliable both on average and under deliberate pressure. We recommend that papers report judge precision and recall on a human-labeled slice, report ASR corrected for judge precision, and include an adversarial check of the judge. Our code is released.

[NLP-30] A Red Teaming Framework for Large Language Models : A Case Study on Faithfulness Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中可靠性、安全性和可信度不足的问题,尤其关注模型输出中存在的不忠实性(unfaithfulness)等潜在漏洞。其解决方案的关键在于提出一种新颖的多角色对抗评估框架,包含目标模型(target)、攻击者模型(attacker)和评审模型(jury)三部分协同工作:攻击者模型生成日益高效的对抗性提示(adversarial prompts),而评审模型则严格评估响应在不同任务中的准确性与一致性。该框架通过系统化暴露模型弱点,揭示了结构约束(如摘要格式限制)对模型脆弱性模式的影响,并发现模型架构设计选择对安全性的影响显著超过参数量扩展。此外,该方法具备跨任务、跨语言的可扩展性,适用于从英文问答到阿拉伯语摘要等多种场景,为持续评估模型安全性能提供了可复用的范式。尽管在多语言对抗提示自动生成方面仍存在挑战,且难以捕捉非显式事实矛盾的细微不忠实现象,但整体上为识别当前LLM漏洞提供了具有实践价值的分析工具与系统性评估路径。

链接: https://arxiv.org/abs/2606.25476
作者: Abrar Alotaibi,Raed Mughus,Moataz Ahmed
机构: King Fahd University of Petroleum and Minerals (KFUPM); Imam Abdulrahman Bin Faisal University (IAU)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint submitted to SQJ

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising target, attacker, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing weaknesses in reliability. The approach identifies how structural constraints in summarization can shape vulnerability patterns, with format limitations yielding measurable gains in faithfulness, and shows that architectural design choices typically outweigh parameter scaling in determining model safety. The framework’s key strength is its adaptability across evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While it excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating adversarial prompt generation across languages. Our experiments also reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly across linguistic contexts. Overall, this architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models evolve.

[NLP-31] Optimizing Abstractive Summarization With Fine-Tuned PEGASUS

【速读】: 该论文旨在解决生成式文本摘要(Abstractive Text Summarization)中模型性能优化的问题,特别是在多语言语料库上的表现提升。针对现有方法在英文摘要任务中仍存在生成质量不足的挑战,本文提出对PEGASUS模型进行微调,以在XL-Sum英文语料库上超越基线模型mT5的性能。其解决方案的关键在于利用XL-Sum数据集对PEGASUS进行针对性微调,并通过ROUGE指标评估生成摘要与人工摘要之间的相似性。实验结果表明,微调后的PEGASUS模型在多个评价指标上均实现显著提升:相较基线模型,ROUGE-1得分提高4.04%,ROUGE-2得分提升15.25%,ROUGE-L得分提升3.39%,验证了该方法在生成更准确、更贴近人类表达的摘要方面的有效性,达到了该语料库上的最新技术水平。

链接: https://arxiv.org/abs/2606.25462
作者: Sadiul Arefin Rafi,Naimur Rahman,Kazi Nazibul Islam,Ha-mim Ahmad,Farig Yousuf Sadeque
机构: BRAC University (布拉克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Abstractive text summarization is the technique of generating a short and concise summary comprising the salient ideas of a source text without making a subset of the salient sentences from the source text. The introduction of transformer models such as BART, T5, and PEGASUS has made this sort of summarization process more efficient and accurate. The objective of this paper is to fine-tune PEGASUS on the XL-Sum English corpus to achieve a better performance compared to the baseline mT5 model. The performance of the generated summaries from the fine-tuned model is evaluated using the ROUGE metric, which basically compares the auto-generated summaries with human-created summaries. To the best of our knowledge, the results from our fine-tuned PEGASUS model give a state-of-the-art performance on the XL-Sum English Corpus. To quantify the improvement, there is a 4.04% improvement in the ROUGE-1 score, a 15.25% increase in the ROUGE-2 score, and a 3.39% improvement in the ROUGE-L score from the baseline model.

[NLP-32] Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis

【速读】: 该论文旨在解决自监督语音模型在细粒度方言变异下的内部音位表征行为尚不明确的问题。现有探针研究多依赖人工标注的语料库,难以推广至自然发生的方言语音数据。其解决方案的关键在于提出一种完全无标签的探针流程:通过语言无关的通用音素识别器生成音素序列,并将其映射为发音特征向量,从而实现无需人工标注即可进行帧级探针分析。实验结果揭示了普通话次方言间发音特征可解码性的结构化模式,其中声学显著特征(如唇音性和嘶音性)相对稳定,而与精细频谱差异相关的特征则表现出更大的方言依赖性变化;这种变化主要由北京话相较于其他次方言更高的可解码性驱动。层间分析进一步表明,不同特征组具有不同的表征动态。研究证实,语言无关的发音特征探针可有效应用于真实世界方言语料,且自监督语音表征对方言的敏感性在发音维度上分布不均。

链接: https://arxiv.org/abs/2606.25459
作者: Shu Shang,Fuliang Weng,Zeqian Hu,Yaqian Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While self-supervised speech models have achieved strong performance across speech tasks, relatively little is known about how their internal phonetic representations behave under fine-grained dialect variation. Existing probing studies typically rely on curated corpora with manual phonetic annotations, limiting their applicability to naturally occurring dialect speech. We present a case study of articulatory feature representations in a Mandarin self-supervised speech model using an entirely unlabeled probing pipeline. Phone sequences are generated using a language-agnostic universal phone recognizer and mapped to articulatory feature vectors, enabling frame-level probing without manual annotation. Our results reveal a structured pattern in articulatory feature decodability across Mandarin sub-dialects. Acoustically salient features such as labiality and stridency remain comparatively stable, whereas features associated with finer spectral distinctions exhibit larger dialect-dependent variation. This variation is driven primarily by elevated decodability for Beijing speech relative to other Mandarin sub-dialects. Layer-wise analyses further show distinct representational dynamics for these feature groups. These findings suggest that language-agnostic articulatory probing can be applied to real-world dialect corpora and that dialect sensitivity in self-supervised speech representations is unevenly distributed across articulatory dimensions.

[NLP-33] he Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms ICML2026

【速读】: 该论文旨在解决传统评估范式下学习算法泛化能力被简化为单一聚合指标所掩盖的根本问题,即:从某一特定示例中学习的知识在多大程度上能够推广至其他示例。这一“逐样本泛化”(per-sample generalization)机制类似于人类认知中的类比学习,但长期以来未被标准基准所捕捉。其解决方案的关键在于提出“泛化谱”(Generalization Spectrum)评估框架,通过构建一系列按转移距离逐步增加的测试变体——从精确复现、跨语言实现迁移、完全叙事重构下的上下文迁移、类别匹配的域内问题,直至无配对基线——系统性地追踪模型在不同迁移距离下的性能表现。该框架不仅揭示算法是否学习,更量化了学习成果的泛化范围。研究以编程竞赛任务为例,采用基于近期题目筛选与合成的管道以避免数据污染,并对比了三种典型学习范式:强化学习(RL)在相同记忆水平下比监督微调(SFT)家族基线更高效地实现近端迁移;而提示学习(ICL)虽表现出强迁移能力,但高度依赖输入对应关系。进一步分析显示,局部性能提升并不必然扩大泛化半径:抽象信息与提示主要增强局部迁移,相对基准的参考式微调(RFT)保持更强的远端迁移尾部性能,而自蒸馏或提示辅助的强化学习即使提升了局部表现或优化效率,反而可能削弱远端泛化能力。

链接: https://arxiv.org/abs/2606.25450
作者: Jinghan Zhang,Zerui Cheng,Shiqi Chen,Ge Zhang,Wenhao Huang,Jiashuo Liu,Junxian He,Tianle Cai
机构: ByteDance(字节跳动)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at ICML 2026. 30 pages, 6 figures

点击查看摘要

Abstract:Traditional evaluations measure a learning algorithm’s final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a specific example generalize to others? Such per-sample generalization, akin to learning by analogy in human cognition, captures how far the knowledge extracted from one example can transfer, yet remains invisible to standard benchmarks. We introduce the Generalization Spectrum, an evaluation framework designed to expose this hidden dimension. For each training example, we construct a controlled suite of test variants arranged by increasing transfer distance, from exact recall to implementation transfer across languages, context transfer under complete narrative re-framing, category-matched in-domain problems, and an unpaired baseline. By tracking performance across these distances, we reveal not just whether an algorithm learns, but how far that learning extends. We instantiate this framework on competitive programming, using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. We first compare three canonical learning paradigms under matched memorization. RL converts memorization into near-transfer more efficiently than SFT-family baselines, while ICL exhibits strong but correspondence-dependent transfer. We then use the Spectrum to diagnose within-family variants. The resulting profiles show that local gains need not expand the generalization radius: abstractions and hints mainly lift local transfer, RFT preserves a stronger far-transfer tail than reference SFT, and self-distillation or hint-assisted RL can reduce far transfer even when local transfer or optimization improves.

[NLP-34] Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

【速读】: 该论文旨在解决大语言模型在具备记忆机制时可能出现的“脆弱记忆”(brittle memory)问题,即模型因保留错误结论而忽略原始输入信息,导致其输出错误答案时表现出高度自信,反而在无记忆状态下能选择不回答以避免错误。这种现象在七个不同模型中均持续存在,且方向不可逆转,表明其本质是行为层面的缺陷,而非底层信息存储的即时性限制。解决方案的关键在于引入“源优先策略”(source-first policy),即在有限记忆预算下优先保留可重新计算的原始输入源(source),而非可重构的推论结果(conclusion)。通过压缩漂移对话并测试纠错能力,研究发现可纠正性取决于决定答案的源头是否存活,而非模型自身能力。采用该策略可在相同预算下恢复模型的可纠正性,实验显示一提示词部署版本可实现0.49–0.88的回收率,显著优于长度匹配的对照组。当错误在记忆循环中传播时,源优先策略仍能维持固定的预算边界,防止错误蔓延;而传统方法则导致下游步骤逐步累积不可修复的偏差。该机制在三个实际部署的记忆系统及真实对话数据集(MultiWOZ)中均复现有效,但一旦原始输入源超出预算无法容纳,修复将无声失效,除非显式记录完整性。研究为可控机制分析,采用无评判器精确评分、同预算控制和故意设计为假的验证器,确保结论可靠性,并已公开实验框架、条件与验证工具。

链接: https://arxiv.org/abs/2606.25449
作者: Alex Kwon
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 3 figures. Code, data, and reproduction harness: this https URL

点击查看摘要

Abstract:A language model’s memory can be worse than having no memory at all. Give a model a memory that kept a wrong conclusion but dropped the work behind it, and it emits that stale value as a confident answer; give the same model an empty memory and it abstains. Across seven models this direction never reverses, a clean kill condition that none breaks. We call this brittle memory: behavioral, not the near-immediate information bound beneath it; only its magnitude is disposition- and task-dependent, not its direction. We measure it with reclaim evaluation: compress a drifted interaction at a fixed budget, then test whether a correction recovers the known answer, scored against ground truth with no judge. Correctability is bottlenecked by whether the answer-determining source survives, not by capability. A one-line source-first policy (keep the recomputable source, drop the re-derivable conclusion) restores correctability at equal budget where that source is compact and identifiable; a length-matched control rules out added text as the cause. The hand-built oracle reaches 1.00; a one-prompt deployable version reclaims 0.49-0.88. The stake compounds: chained through a memory loop, a single dropped-source error corrupts a growing span of downstream steps and stays uncorrectable, while source-first holds to a bounded budget horizon. The wall and fix replicate across three deployed memory systems and on real dialogue (MultiWOZ), and past the budget where the source no longer fits, the fix fails silently unless the note records completeness. This is a controlled study of a mechanism, not a benchmark: judge-free exact scoring, matched-budget controls, and validators built to come out false. We release the harness, conditions, and validators.

[NLP-35] he Interplay of Harness Design and Post-Training in LLM Agents

【速读】: 该论文旨在解决工具集成型大语言模型(LLM)智能体在实际部署中面临的动态环境适应性问题,尤其是现有后训练(post-training)方法通常假设环境静态,而忽视了工具环境与任务在部署过程中可能发生显著变化的现实挑战。其核心问题是:当前主流框架将“工具调用框架”(harness)视为固定工程实现,缺乏对框架设计本身作为可调控变量的考量,导致智能体在面对分布外(out-of-distribution, OOD)的工具环境或任务变更时性能急剧下降。解决方案的关键在于将harness从静态配置转变为可控的设计维度,并构建支持任务与工具环境动态变化的评估基准——基于此,提出并验证了“感知框架的后训练”(harness-aware post-training)方法。实证结果表明,该方法不仅提升了模型在原始分布内的性能,更显著增强了其在复杂、动态环境下的鲁棒适应能力,尤其在低设计投入的简化框架下,传统后训练策略表现严重退化,进一步凸显了考虑框架设计影响的重要性。

链接: https://arxiv.org/abs/2606.25447
作者: Kyungmin Kim,Youngbin Choi,Seoyeon Lee,Suhyeon Jun,Dongwoo Kim,Sangdon Park
机构: POSTECH(韩国浦项科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-integrated LLM agents are often wrapped within a harness: the scaffolding that determines which tools are exposed, how they are described, and what auxiliary information accompanies each per-step observation. While agents are routinely post-trained, this scaffolding is typically treated as a fixed engineering detail, with design effort limited to the training-free regime. Moreover, existing post-training algorithms assume a static environment, even though tool environments and tasks often shift upon deployment. To address this gap, we extend \textttALFWorld (i) to treat the harness as a controllable design dimension and (ii) to support evaluation under task and tool environment shifts. Building on this, we systematically analyze how the harness design influences post-training in both in-distribution and out-of-distribution (OOD) settings. We empirically show that harness-aware post-training not only improves in-distribution performance but also enables agents to robustly adapt to OOD settings. Under a harness with minimal design effort, post-training suffers a drastic performance drop under stronger tool environment shifts, further highlighting the importance of harness-aware post-training under such shifts.

[NLP-36] PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的安全对齐问题:当安全要求以自然语言政策形式动态更新时,传统依赖高质量监督数据(如安全示范或偏好对)的对齐方法难以及时响应,导致政策演进与数据驱动对齐方法之间存在严重滞后。其解决方案的关键在于提出PolicyAlign框架,通过直接基于安全政策进行对齐,实现高效、可扩展的安全适应。核心机制包括:首先生成违反政策的指令以增强模型对政策边界的敏感性,随后采用在策略自蒸馏(on-policy self-distillation)的方式使模型内化政策引导的行为;为进一步提升训练稳定性和数据效率,引入政策敏感性过滤(Policy-Sensitive Filtering),筛选出政策引发最大行为偏移的指令,从而聚焦于最具学习价值的样本。实验表明,PolicyAlign在多个模型上均能显著提升安全性,同时保持较低的过度拒绝率并维持模型通用能力,且具备在医疗、法律、金融等复杂领域泛化的能力,展现出作为可维护、可扩展的政策驱动型安全对齐方案的巨大潜力。

链接: https://arxiv.org/abs/2606.25442
作者: Chang Wu,Junfeng Fang,Houcheng Jiang,Kai Tang,Pengyu Cheng,Xiaoxi Jiang,Guanjun Jiang,Xiang Wang
机构: Alibaba(阿里巴巴); National University of Singapore (新加坡国立大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often specified as natural-language policies, while corresponding supervision data may be costly, delayed, or unavailable. This creates a mismatch between rapidly evolving safety policies and conventional data-driven alignment methods. To address this, we propose PolicyAlign, a simple yet effective framework for directly aligning LLMs with safety policies. Given a safety policy, PolicyAlign first synthesizes policy-violating instructions and then performs on-policy self-distillation to internalize policy-guided behavior. To improve training stability and data efficiency, we further introduce Policy-Sensitive Filtering, which selects instructions where the policy induces the largest behavioral shift. Experiments across multiple models show that PolicyAlign consistently improves safety while maintaining low over-refusal and preserving general capabilities. PolicyAlign also generalizes to medical, legal, and financial safety scenarios, highlighting its potential as a scalable and maintainable approach to policy-based LLM safety alignment. The code is released at this https URL.

[NLP-37] Beyond Next-Observation Prediction: Agent -Authored World Modeling for Sequential Decision Making

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在世界建模过程中,传统以“下一观测预测”作为学习目标所导致的监督信号与代理实际决策需求脱节的问题。这种传统方法将监督依赖于环境状态转移的具体内容,可能忽略了对当前决策至关重要的动态信息。为弥补这一缺陷,论文提出了一种名为代理自述世界建模(Agent-Authored World Modeling, AAWM)的训练范式,其核心在于从代理自身的决策需求生成监督信号。具体而言,在每个状态中,代理主动识别其在行动前所需理解的环境动态,基于这些需求跨轨迹检索相关状态转移证据,并将其合成为目标训练信号,从而捕捉面向决策的动态特征,而非简单重建下一观测。该方法的关键创新在于将学习目标对齐于代理执行动作前所需的动态知识,而非仅关注观测序列的完整性,显著提升了世界模型在多环境和多训练设置下的有效性,实验表明决策感知的世界模型目标相比传统方法提供了更优的学习信号。

链接: https://arxiv.org/abs/2606.25421
作者: Guangfeng Cai,Kaibing Yang,Shuo He,Yu Li,Shengtian Yang,Jiaqi Lv,Lei Feng
机构: Southeast University(东南大学); Meituan(美团)
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Recent studies on world modeling for Large Language Model (LLM) agents typically formulate the learning objective as next-observation prediction. However, this objective ties supervision to what a transition happens to reveal, which may omit the dynamics most relevant to the agent’s current decision. To bridge this gap, we propose Agent-Authored World Modeling (AAWM), a training procedure that constructs supervision from the policy’s own decision needs. Specifically, at each state, the agent identifies what it needs to understand about the environment before acting. These needs drive the retrieval of relevant transition evidence across trajectories, which is then synthesized into training targets that capture decision-oriented dynamics instead of reconstructing the next observation. This aligns the training objective with the dynamics the policy needs before acting, not with the contents of the next observation. Experimental results validate the effectiveness of AAWM across multiple environments and training settings. These results show that decision-aware world-model targets provide a more effective learning signal than next-observation prediction.

[NLP-38] Introducing corpora Hlava Cor and Hlava AD: Human Label Variation in Coreference and Discourse Relations

【速读】: 该论文旨在解决文本连贯性(text coherence)标注中人工标注者分歧(annotator disagreement)的内在差异问题,揭示人类在理解语篇连贯性时存在的个体认知差异。其核心解决方案在于构建两个大规模多标注的捷克语语料库:第一个语料库包含1,024个上下文,由三位标注者并行标注核心指代关系(coreference),涵盖代词、名词短语及回指副词等不同语法-语义类别;第二个语料库包含512个上下文,由五位标注者并行标注属性性与非属性性结构中的语篇关系(discourse relations)。两套语料库均实现了约60%-65%的标注者间一致性水平,且分析表明,在自动核心指代消解模型产生分歧的案例中,人工标注的一致性更低,说明此类样本对人类标注者同样具有更高的模糊性或难度。此外,通过分析标注者的解释性注释,研究进一步揭示了标注者在语篇理解上的解释差异、判断信心程度不一以及个体阅读策略的多样性,为理解人类标注行为的主观性提供了实证依据。

链接: https://arxiv.org/abs/2606.25383
作者: Anna Nedoluzhko,Šárka Zikánová,Jiří Mírovský,Milan Straka,Eva Hajičová
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to SLiDE 2026

点击查看摘要

Abstract:As previous research on annotator disagreement in discourse phenomena has shown, understanding text coherence varies considerably from one individual to another. To explore this phenomenon, we created two corpora with multiple annotations of Czech texts, accompanied by annotators’ explanations of their choices. The first corpus consists of 1,024 contexts annotated in parallel by three annotators. It captures differences in the identification of coreference across various text types and grammatical-semantic categories, including pronouns, full noun phrases, and anaphoric adverbials. The second corpus comprises 512 contexts, annotated in parallel by five annotators, and focuses on identifying discourse relations in attributive and non-attributive constructions. Both corpora achieve a comparable inter-annotator agreement of approximately 60-65%. For coreference annotation, agreement tends to be lower in cases where automatic coreference resolution models disagree, suggesting that when the models disagree, the examples tend to be more difficult or ambiguous for human annotators to interpret. The annotators’ comments, both for coreference and discourse relations, further reveal differences in interpretation, varying levels of confidence in text understanding, and individual reading strategies.

[NLP-39] A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models ACL

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)在不同语言与文化背景下安全行为不一致的问题,尤其关注毒性内容检测与去毒化技术的跨语言适用性。其核心挑战在于,攻击者可通过语言选择、翻译中间环节、语码转换、拼写变体、多轮交互及部署后微调等手段,绕过模型的安全对齐机制。解决方案的关键在于构建兼顾多语言覆盖与文化敏感性的系统性框架,涵盖多种任务范式(如毒性到中性文本重写、毒性分类、生成毒性评估),并整合跨语言编码器、翻译流水线、表示层探测和基于大模型的检测器等多模态检测方法。同时,提出从数据过滤、监督与偏好驱动微调、解码阶段引导、表示编辑到多语言防护墙在内的多层次缓解策略。然而,当前仍面临语言覆盖不均、危害定义的文化依赖性、评估标准碎片化以及去毒过程可能压制合法方言或身份表达等根本性难题。

链接: https://arxiv.org/abs/2606.25380
作者: Soham Dan,Himanshu Beniwal,Thomas Hartvigsen
机构: Scale AI( Scale AI); Indian Institute of Technology Gandhinagar(印度理工学院甘吉纳格尔分校); University of Virginia(弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the Findings of ACL, 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts. This survey synthesizes work on toxicity detection and detoxification for multilingual LLMs. We first catalogue threat models that exploit language choice, translation pivots, code-switching, orthographic variation, multi-turn interaction, and post-deployment fine-tuning to weaken safety alignment. We then organize task formulations (toxic-to-neutral rewriting, toxicity classification, and toxic-generation evaluation), multilingual detection approaches (cross-lingual encoders, translation pipelines, representation-level probes, and LLM-based detectors), and mitigation strategies spanning data filtering, supervised and preference-based tuning, decoding-time steering, representation editing, and multilingual guardrails. Across these areas, we identify persistent challenges: uneven language coverage, culturally contingent definitions of harm, fragmented evaluation protocols, and the risk that detoxification suppresses legitimate dialectal or identity-related expression.

[NLP-40] Story Operators: Decomposing the Original to Sequel Transformation in Embedding Space

【速读】: 该论文旨在解决文学续作(sequel)生成过程中的内在结构演化问题,即从几何角度刻画一部小说如何通过文本空间中的变换转化为其续作。其核心挑战在于揭示续作创作中隐含的、可解释的语义与结构变迁模式。解决方案的关键在于将每部作品视为嵌入空间中的一个点(基于all-mpnet-base-v2的段落嵌入),并计算原作与续作之间的位移向量 $ d = \bar{x}{\text{seq}} - \bar{x}{\text{orig}} $,随后利用主成分分析(PCA)在两部作品自身段落基础上构建内容基底,并对位移向量进行贪婪分解,得到一系列具有可解释性的轴。这些轴以真实文本片段为极点,代表了从原作到续作转变过程中最关键的语义方向。研究发现,续作的演化可分为三类:公式化(低秩微小变化)、集中式(单一主导轴)和组合式(多个小尺度轴)。以《汤姆·索亚历险记》到《哈克贝利·费恩历险记》为例,主导轴反映的是叙事结构从庇护性家庭生活向流浪式流浪史诗的转变,而非表面主题如方言或奴隶制,这表明深层结构转型先于表层主题展开,且路径穿越“冒险-旅程”空间而非趋向通用现实主义。研究通过对比马克·吐温写给豪威尔斯的信件证实了这一几何重构的合理性,量化了其创作意图在实际文本演变中所覆盖的程度,整体方法具备完全可复现性。

链接: https://arxiv.org/abs/2606.25379
作者: W. Frederick Zimmerman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:I treat a book as a point in a sentence-embedding space and a literary transformation as an operation on points. Given an original novel and its sequel, I ask what it takes, geometrically, to turn the first into the second. Using all-mpnet-base-v2 paragraph embeddings drawn from a precomputed index of the PG19 corpus, I form the displacement d=\barx_\rm seq-\barx_\rm orig and greedily decompose it along a content basis obtained by PCA over the two books’ own paragraphs. Each component is an interpretable axis anchored by real passages at its poles. Across thirteen verified author pairs from Project Gutenberg, the decomposition reveals a small taxonomy of sequels: formulaic (a tiny, low-rank change: Doyle’s Holmes collections, |d|=0.12 ), concentrated (one dominant axis: Alcott’s Little Women \to Little Men, 75% on a single move), and compositional (many small axes: Twain, Burroughs’s Barsoom, Nesbit). For the canonical case, Tom Sawyer \to Huckleberry Finn, the dominant recovered axis is structural – the collapse of sheltering domesticity into a picaresque road – rather than the famous surface themes of vernacular voice or slavery, which ride later, smaller axes; and the transformation routes through adventure-journey space rather than diluting toward generic realism. I corroborate the recovered geometry against Twain’s documented authorial intent (his 1875–76 letters to Howells), which names the first-person picaresque move years in advance, and I quantify, with an explicit representation caveat, how much of the realized transformation his stated intentions span. All computations are reproducible from the released scripts and data.

[NLP-41] Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

【速读】: 该论文旨在解决生成式语音合成(Text-to-Speech, TTS)系统在日语场景下表现不足的问题,尤其针对日语特有的语言挑战——即汉字(kanji)的上下文依赖性多音现象(context-dependent kanji polyphony)。现有基于大语言模型(Large Language Model, LLM)的TTS系统主要聚焦于英语和中文,对日语的支持相对薄弱。为应对这一问题,本文提出Sarashina2.2-TTS,其核心解决方案在于双轨并行的设计:一是通过大规模数据策略,构建约361小时的日语与英语混合训练数据集,并设计针对性的数据增强流程,覆盖日本文化厅规定的全部2,136个常用汉字(Joyo kanji),以有效提升多音字消歧能力;二是引入“常用汉字读音基准测试集”(Joyo Kanji Yomi Benchmark),涵盖所有2,136个常用汉字及其4,378种读音,并提出“假名字符错误率”(Kana-CER)作为评价指标,在假名空间内消除拼写变体干扰,精准衡量发音正确性。实验表明,该方法显著提升了汉字级发音准确率,整体达到当前最优的汉字级读音准确水平,且在句子级发音质量上媲美顶尖基线系统,同时在零样本日语语音合成中实现了最高的说话人相似度。跨语言评估进一步验证了其鲁棒性:无论输入提示语言为何,系统均能保持稳定的日语发音表现,证明了平衡训练策略在提升跨语言一致性方面的有效性。

链接: https://arxiv.org/abs/2606.25369
作者: Lianbo Liu,Shiao Zhu,Kai Washizaki,Reo Yoneyama,Haesung Jeon,Mengjie Zhao,Yusuke Fujita,Hao Shi,Nao Yoshida,Yuan Gao,Roman Koshkin,Yukiya Hono,Yui Sudo
机构: SB Intuitions( SB Intuitions)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique linguistic challenges, such as widespread context-dependent kanji polyphony, have yet to be adequately tackled. Here we introduce Sarashina2.2-TTS (this https URL), a Japanese-centric LLM-TTS system that tackles these challenges through a dual approach: data strategy and evaluation methodology. First, we scale training to approximately 361k hours of speech, incorporating a balanced mix of Japanese and English data. Furthermore, we design a targeted data augmentation pipeline covering all 2,136 Joyo (regular-use) kanji designated by Japan’s Agency for Cultural Affairs to efficiently address kanji polyphony disambiguation. Second, we introduce the Joyo Kanji Yomi Benchmark (this https URL), covering all 2,136 Joyo kanji and their 4,378 readings. Alongside this benchmark, we propose Kana-CER, a metric that compares synthesized speech against reference readings in the kana space, eliminating orthographic variations to directly measure pronunciation correctness. Experiments demonstrate that our targeted data augmentation significantly improves reading accuracy. Overall, Sarashina2.2-TTS achieves state-of-the-art kanji-level reading accuracy and matches top baselines on general sentence-level pronunciation, while delivering the highest speaker similarity in zero-shot Japanese speech synthesis. Furthermore, cross-lingual evaluation reveals that Sarashina2.2-TTS is the only system that maintains stable Japanese pronunciation regardless of the prompt language, confirming that our balanced training approach improves cross-lingual robustness.

[NLP-42] Neural Machine Translation for Low-Resource Tangkhul–English

【速读】: 该论文旨在解决低资源语对Tangkhul-English(nmf-en)的机器翻译问题,其中Tangkhul是一种严重缺乏自然语言处理基础设施的藏缅语系语言。其核心挑战在于训练数据极度稀缺,且缺乏标准化的文本处理工具。解决方案的关键在于采用基于Transformer架构的预训练模型进行微调:主系统采用ByT5-large模型,在38,336句平行语料上进行微调,实现了较高的翻译性能(如BLEU 39.97、chrF++ 58.07、BERTScore F1 0.8104及COMET 0.7302),而对比系统则使用mT5-small模型以验证模型规模与性能的关系。研究还揭示了拉丁字母拼写中特有的变音符号(diacritics)带来的拼写规范性挑战,并指出训练数据存在宗教文本、故事和对话等领域的分布偏差,强调未来通过数据多样化和领域自适应策略可进一步提升模型泛化能力。

链接: https://arxiv.org/abs/2606.25365
作者: Chormi Zimik Vashai,Agniva Maiti
机构: KIIT University (基伊特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 9 tables

点击查看摘要

Abstract:We present a study on low-resource machine translation for the Tangkhul-English (nmf-en) language pair. Tangkhul is a severely under-resourced Tibeto-Burman language spoken primarily in Manipur, India, with virtually no prior natural language processing infrastructure. We describe two systems: (1) a primary system based on ByT5-large fine-tuned on 38,336 Tangkhul-English parallel sentence pairs, and (2) a contrastive system based on mT5-small fine-tuned on the same corpus. Our primary ByT5-large system achieves a corpus BLEU score of 39.97, chrF++ of 58.07, BERTScore F1 of 0.8104, and COMET (wmt22-comet-da) of 0.7302 on a held-out test set of 3,856 sentences. We further discuss the orthographic challenges specific to Tangkhul’s Latin-script diacritics, the domain bias of our training corpus (which comprises biblical text, stories, and conversational data), and avenues for future improvement through data diversification and domain adaptation.

[NLP-43] Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

【速读】: 该论文旨在解决生成式语言模型在测试时扩展(test-time scaling)过程中存在的核心矛盾:传统的链式思维(chain-of-thought)采样虽能提升推理能力,但其长序列采样过程仍为单线程执行,难以并行化;而基于句子或解级的搜索方法虽然具备更强的探索能力,却面临计算开销大、难以端到端训练的问题。为此,论文提出一种名为局部分支路由(Local Branch Routing, LBR)的令牌级测试时扩展框架,其关键在于通过构建一个小型局部前瞻树(local lookahead tree),并利用轻量级路由器对候选路径的隐藏状态进行路由决策,从而在保持离散分支身份的前提下,实现高效的多路径并行评估与选择。该方法通过在每个令牌决策阶段引入对候选未来路径隐藏状态的判别性信息,使决策不仅依赖于当前根节点的下一个词分布,还能利用更丰富的上下文证据,同时避免了全局解空间的完整搜索。所提出的“剪枝-转移-生长”(prune-shift-grow)解码机制保留了树形轨迹的可追踪性,并定义了可计算的树路径似然函数——新生成节点首次被采样时计入,路由决策赋予显式概率,从而支持基于似然比原则的端到端强化学习优化,联合训练基础模型与路由模块。实验表明,在合成层次规划任务中,后候选隐藏状态提供了有效的路由依据;在数学推理基准上,LBR显著优于离散链式思维、原始离散令牌强化学习(RLVR)以及兼容强化学习的软令牌分支基线,在Pass@1和Pass@32指标上均取得提升,验证了轻量级局部分支是一种高效、可训练且保持离散性的语言模型测试时扩展新范式。

链接: https://arxiv.org/abs/2606.25354
作者: Yutong Yin,Mingyu Jin,Jin Pan,Changyi Yang,Zijie Xia,Dhruv Pai,Shuming Hu,Zhen Zhang,Chenyang Zhao,Jinman Zhao,Wujiang Xu,Raymond Li,Xin Eric Wang,Julian McAuley,Zhaoran Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-time scaling improves language-model reasoning, but existing approaches often face a difficult trade-off: long chain-of-thought sampling remains single-threaded, while sentence- or solution-level search can be computationally expensive and hard to train end-to-end. We introduce Local Branch Routing (LBR), a token-level test-time scaling framework that expands a small local lookahead tree, forwards all sampled branches through the language model, and uses a lightweight router to select the depth-1 subtree to commit. By routing over the hidden states of candidate local futures, LBR allows each token decision to use evidence beyond the root next-token distribution while avoiding full solution-level search. The resulting prune-shift-grow decoding process preserves discrete branch identities and defines a tractable tree-trajectory likelihood: newly grown nodes are counted when first sampled, and router decisions are assigned explicit probabilities. This enables end-to-end reinforcement learning with verifiable rewards, jointly optimizing the base model and router under the same likelihood-ratio principle as discrete-token RLVR. On synthetic hierarchical-planning tasks, LBR shows that post-candidate hidden states provide useful routing evidence. On mathematical reasoning benchmarks, LBR improves both Pass@1 and Pass@32 over discrete chain-of-thought, vanilla discrete-token RLVR, and RL-compatible soft-token branching baselines. These results suggest that lightweight local branching offers an efficient, trainable, and discrete form of language-model test-time scaling.

[NLP-44] Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生物医学领域应用中普遍存在的幻觉(hallucination)和知识过时问题,尤其针对复杂医学问答(Medical Question Answering, QA)任务。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法存在两大局限:其一,医学知识通常分散于多份文档中,而主流RAG方法依赖单一检索路径,难以同时保留细粒度语义信息与全局结构化关联;其二,静态检索策略难以支持复杂医学推理所需的深度认知过程。为此,本文提出一种双路径迭代检索-推理框架——Hybrid-IR,其核心创新在于融合图结构检索(用于挖掘结构化知识)与密集向量检索(用于细粒度语义匹配),并通过迭代的“检索-推理”循环逐步优化推理轨迹,实现对复杂医学问题的精准、可解释性回答。实验在三个主流医学QA基准上验证了该方法的有效性。

链接: https://arxiv.org/abs/2606.25338
作者: Sheng Wan,Jiahui Zhang,Zicheng Zhao,Shougang Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promising performance across a wide range of biomedical applications, including medical question answering (QA), yet they remain prone to hallucinations and outdated knowledge. Although retrieval-augmented generation (RAG) can alleviate this issue by incorporating external documents, there still exist two fundamental limitations. First, medical knowledge is often fragmented across documents, while most RAG methods rely on a single retrieval path, which makes it challenging to jointly preserve fine-grained semantic information and structured global associations. Second, static retrieval strategies are typically insufficient to support deep reasoning that is important in complex medical QA. In this paper, we present a dual-path retrieval framework with an iterative retrieval-reasoning mechanism termed “Hybrid-IR” for complex medical QA. The proposed Hybrid-IR integrates graph-based retrieval for exploration of structured knowledge and dense retrieval for fine-grained semantic matching. Moreover, the reasoning trajectory can be progressively refined through an iterative retrieve-reason loop. Experiments on three widely used medical QA benchmarks demonstrate the effectiveness of our Hybrid-IR.

[NLP-45] Improved Large Language Diffusion Models

【速读】: 该论文旨在解决传统大型语言模型(Large Language Models, LLMs)普遍采用自回归因子分解(autoregressive factorization)与因果注意力(causal attention)所导致的生成效率低下及上下文信息利用不充分的问题。其核心解决方案是提出iLLaDA,一个从零开始训练的80亿参数(8B)掩码扩散语言模型(masked diffusion language model),采用全双向注意力机制(fully bidirectional attention),并在预训练和监督微调(SFT)阶段均保持掩码扩散目标。关键创新包括:将预训练规模扩展至12万亿(12T)token,基于250亿token指令语料进行12轮微调;引入变长生成策略以提升效率,并采用基于置信度的评分方法优化多选题评估。实验表明,iLLaDA在通用、数学和代码等多类基准测试中显著优于同规模自回归模型(如LLaDA),例如iLLaDA-Base在BBH上提升21.6分,在ARC-Challenge上提升14.9分;iLLaDA-Instruct在MATH上提升14.5分,在HumanEval上提升16.5分。尽管采用非自回归训练方式,iLLaDA仍可媲美Qwen2.5 7B。结果证明,从零开始的全双向扩散训练是一种具备竞争力的强语言模型构建路径。

链接: https://arxiv.org/abs/2606.25331
作者: Shen Nie,Qiyang Min,Shaoxuan Xu,Zihao Huang,Yuxuan Song,Yong Shan,Yankai Lin,Wayne Xin Zhao,Chongxuan Li,Ji-Rong Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emphiLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: this https URL.

[NLP-46] Multilingual Hematology Visual Question Answering Dataset

【速读】: 该论文旨在解决当前医学影像分析中视觉语言模型(VLMs)在多语言医疗环境下的适用性问题,尤其针对南亚地区(特别是巴基斯坦)存在的英语主导的医疗信息资源与本地广泛使用的乌尔都语之间存在的语言鸿沟。现有血液学领域的视觉语言数据集主要以英文为中心,难以支持非英语背景下的临床应用。为应对这一挑战,研究提出WBCMor VQA——一个面向白细胞形态学分析的双语(英语-乌尔都语)视觉问答(VQA)基准,其关键在于引入形态学感知的标注和领域特定的乌尔都语血液学词典,确保语言一致性与临床准确性。该基准包含11万对双语问答数据,覆盖2万张正常及白血病单细胞图像,为多语言医疗AI系统的开发提供了高质量、临床验证的数据支持。通过在该基准上评估多个开源视觉语言模型,研究建立了基线性能,推动了可访问且具临床相关性的多语言医疗人工智能系统的发展。

链接: https://arxiv.org/abs/2606.25246
作者: Hajra Malik,Hafiza Tooba Aftab,Abdul Rehman,Mohsen Ali,Waqas Sultani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Vision Language Models (VLMs) have shown promising capabilities in medical image analysis by jointly understanding visual and textual information for tasks such as Visual Question Answering. However, existing hematology vision-language resources remain predominantly English centric, limiting their applicability in multilingual healthcare environments. This challenge is releveant generally to South Asia and specifically to Pakistan, where Urdu is widely used despite healthcare information and digital medical systems being largely dependent on English. To investigate this gap, we conducted a survey among healthcare professionals, which revealed substantial language mismatches between clinical documentation and patient communication, emphasizing the need for multilingual healthcare technologies. To address this limitation, we introduce WBCMor VQA, a clinically validated bilingual English, Urdu morphology aware VQA benchmark for leukemia and normal white blood cell analysis. The benchmark is constructed using morphology-aware annotations from LeukemiaAttri and WBCAtt datasets and supported by a domain specific Urdu hematology dictionary to ensure linguistic consistency and clinical correctness. The final benchmark contains 110K bilingual question answer pairs serving as VQA annotations for 20K leukemic and normal single-cell images. Furthermore, we establish baseline performance by evaluating multiple open-source VLMs on the proposed benchmark. The proposed resource aims to facilitate the development of accessible and clinically relevant AI systems for multilingual healthcare environments.

[NLP-47] owards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars WWW

【速读】: 该论文旨在解决阿拉伯语-英语《Al-Mawrid》词典在机器可读性方面的结构性缺失问题,即传统印刷词典未为机器处理设计,缺乏标准化的微观结构(microstructure),难以直接支持自然语言处理(Natural Language Processing, NLP)等应用。其核心解决方案是提出一种基于解析表达式语法(Parsing Expression Grammars, PEG)的级联式解析方法,将原始词典条目中无结构的字符流转化为具有层次结构的机器可读格式。该方法通过识别并组织词典条目中的子条目(subentries)、定义短语、领域标签、交叉引用及翻译对应关系等组件,显式地重构了词典内容的内部结构。研究表明,尽管阿拉伯语词典缺乏统一的微观结构标准,但通过自动或半自动方式诱导其结构仍可实现较高准确率,为后续词典的智能化利用提供了可行路径。

链接: https://arxiv.org/abs/2606.25231
作者: Diaa Mohamed Fayed,Aly Aly Fahmy,Mohsen Abdelrazek Rashwan,Wafaa Kamel Fayed
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures, 7 tables. The final publication is available at this https URL . Published in International Journal of Computational Linguistics Research (IJCLR), DLINE, March 2014, Vol 5, Issue 1, pp 1-13

点击查看摘要

Abstract:Dictionaries are rich sources of lexical information about words that is required for many applications of natural language processing and human language technology. However, publishers prepare printed dictionaries for human usage not for machine processing. This paper presented a method to structure partly a machine-readable version of the Arabic-English Al-Mawrid dictionary. The method converted the entries of Al-Mawrid from a stream of words and punctuation marks into hierarchical structures. The hierarchical structure expresses the components of each dictionary entry in explicit format. A dictionary entry is composed of subentries and each subentry consists of defining phrases, domain labels, cross-references, and translation equivalences. We designed the proposed method as cascaded steps where parsing is the main step. We implemented the parser using the parsing expression grammars formalism. In conclusion, although Arabic dictionaries do not have microstructure standardization, this study demonstrated that it is possible to structure them automatically or semi-automatically with plausible accuracy after inducing their microstructure.

[NLP-48] ASAP: Agent -System Co-Design for Wall-Clock-Centered Auto HPO Research for ML Experiments

【速读】: 该论文旨在解决超参数优化(Hyperparameter Optimization, HPO)中的样本效率问题,核心挑战在于如何在有限的计算预算内高效找到性能优异的超参数配置。现有基于大语言模型(LLM)的HPO方法虽展现出提升每轮迭代性能的潜力,但存在两大根本性局限:其一,将LLM作为单一工具替代传统优化器,受限于预训练目标所赋予的单一归纳偏置,难以应对多样化任务带来的分布漂移;其二,仅以迭代次数为评估指标,忽略了实际运行中每轮需串行执行LLM推理与工具调用所带来的延迟,导致迭代效率提升无法转化为端到端的时钟时间(wall-clock time)优化。针对上述问题,本文提出ASAP——一种代理-系统协同设计框架。其关键创新在于:在代理层面,利用LLM动态集成多种具有不同归纳偏置的优化器,并在每轮选择最优提案;在系统层面,通过前缀稳定的提示设计最大化键值缓存(KV-cache)复用,采用基于相对误差接受准则的推测并行机制隐藏LLM与工具的延迟,同时引入离关键路径的自调优模块根据执行日志动态调整推测阈值。大量实验表明,ASAP在多样化的现代HPO任务上持续优于基线方法,验证了多工具融合与代理-系统协同设计在提升实际运行效率方面的显著价值。

链接: https://arxiv.org/abs/2606.25207
作者: Taicheng Guo,Haomin Zhuang,Kehan Guo,Yujun Zhou,Nitesh V. Chawla,Olaf Wiest,Xiangliang Zhang
机构: University of Notre Dame(圣母大学); United States(美国)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hyperparameter Optimization (HPO) is essential for maximizing machine learning model performance, and its core challenge is sample efficiency: finding strong configurations within a limited budget. Because every HPO tool relies on a surrogate prior that imparts its own inductive bias, individual tools struggle once problems become sufficiently diverse and drift from these priors. Motivated by the reasoning and generalization capabilities of LLMs, recent work has explored using LLMs for HPO and reports improved per-iteration performance. Yet these methods share two limitations with a common origin: they use the LLM as a single-tool replacement evaluated by iteration count. (i) Deployed in place of prior tools, the LLM is itself constrained by its pretraining objective to one family of inductive-biased proposals; this single-source setup still fails to handle the full diversity of problems. (ii) Per-iteration evaluation ignores that, in real runs, LLM inference or tool execution is paid serially on top of model evaluation every round, so iteration-count gains do not translate into end-to-end wall-clock gains. We present ASAP, an agent-system co-design that addresses both limitations. On the agent side, ASAP uses the LLM to integrate a diverse pool of inductive-biased optimizers and to select among their proposals each round. On the system side, ASAP re-architects the loop to reduce end-to-end wall-clock while preserving regret quality: a prefix-stable prompt maximizes KV-cache reuse across rounds; speculation parallelism hides the remaining LLM and tool latency under model evaluation via a relative-error accept test; and a Self-Tuner adapts the speculation threshold from execution logs off the critical path. Extensive experiments on diverse modern HPO tasks show that ASAP consistently outperforms baselines, underscoring the value of tool integration and agent-system co-design.

[NLP-49] RAVEN: Long-Horizon Reasoning Navigation with a Visuo-Spatio-Temporal Memory

【速读】: 该论文旨在解决长期机器人部署中面临的记忆系统挑战,即如何构建一种紧凑且可扩展的记忆机制,以保留细粒度的视觉语义信息、将观测结果在时空维度上进行精准定位,并支持高效的数据存储与检索。其核心问题在于现有基于图像描述(caption-based)的内存系统存在语义损失和效率瓶颈,难以满足长时程任务对高精度时空感知与快速响应的需求。解决方案的关键在于提出RAVEN——一种面向长时程机器人问答与导航的代理型记忆系统(agentic memory system),通过将视觉嵌入(visual embeddings)与位姿(pose)及时间戳一同存储于向量数据库,并结合空间地图实现检索的时空锚定,从而直接在视觉嵌入层面进行操作,避免了有损的图像到文本生成过程。该方法不仅实现了语义、空间与时间维度上的精确检索,还在多个模拟与真实世界视频问答基准测试中显著优于传统基于描述的内存系统,在长时程任务上达到前沿视觉语言模型(VLMs)的性能水平,同时检索成本降低10倍。最终,该系统成功部署于Unitree Go1机器人,在多个大型室内环境中实现了自然语言目标导向的长时程导航任务。

链接: https://arxiv.org/abs/2606.25206
作者: Yixun Hu,Zhicheng Zheng,Lihan Zha,Chunwei Xing,Rajdeep Singh,Omar Hossain,Antonio Loquercio,Dhruv Shah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project website: this https URL

点击查看摘要

Abstract:Long-term robot deployment requires a compact and scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval. In this paper, we propose RAVEN, an agentic memory system for long-horizon robotic question answering and navigation. RAVEN stores visual embeddings with pose and time in a vector database, and grounds retrieval in a spatial map to answer queries and navigate to goals. By operating directly on visual embeddings, RAVEN avoids lossy image-to-text captioning and enables accurate semantic, spatial, and temporal retrieval at scale. Across several simulated and real-world video question-answering benchmarks, RAVEN consistently surpasses caption-based memory systems and matches frontier VLMs on long-horizon tasks at 10 \times lower retrieval cost. Finally, we instantiate RAVEN on a Unitree Go1 robot for the task of long-horizon navigation for natural language goal-reaching, and show successful deployment over several large indoor environments.

[NLP-50] o Isolate or to Score? Model-Adaptive Assessment for Cost-Efficient Multi-Agent RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中多智能体文档评估(multi-agent document assessment)计算成本高昂的问题,尤其关注在小型可部署模型上,其评估机制缺乏清晰理解的现状。核心问题是:如何在不显著增加计算开销的前提下,有效提升RAG系统的性能。解决方案的关键在于揭示了不同模型能力水平下评估机制的本质差异:对于较弱的基础模型,性能提升主要源于对单个文档的隔离处理,而非文档质量评分;令人惊讶的是,完全去除评估环节而仅采用文档隔离即可达到与完整多智能体评估相当的效果,表明消除多文档上下文混淆是取得高达50个百分点性能提升的核心因素。而对于较强的基础模型,文档质量评分则成为关键,为此提出“推理-评分耦合”(Reasoning-Score Coupling)——一种无需标签的扰动探针方法,用于识别和分类模型的评分行为。基于上述发现,论文提出MADARA(Model-Adaptive Routing Architecture),一种模型自适应路由架构,其诊断阈值通过单一预训练模型的试点实验获得,并能零样本泛化至四个未见过的模型族,从而实现高效、轻量且鲁棒的评估流程,彻底规避了传统多智能体评估的高计算开销。

链接: https://arxiv.org/abs/2606.25191
作者: Jungseob Lee,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国高丽大学); Soongsil University (中西大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 2 figures, 19 tables. Code: this https URL

点击查看摘要

Abstract:Multi-agent document assessment for retrieval-augmented generation is computationally expensive, driving practitioners toward smaller, deployable models whose assessment mechanisms remain poorly understood. We conduct a controlled study of training-free interventions on 7B-9B instruction-tuned models across diverse QA benchmarks, revealing a sharp dichotomy in how models benefit from assessment. For weaker baselines, the dominant mechanism is per-document isolation. Astoundingly, assessment-free isolation matches full multi-agent assessment, demonstrating that resolving multi-document context confusion, rather than scoring quality, drives outsized gains of up to 50 percentage points. Conversely, for strong baselines where scoring quality matters, we introduce Reasoning-Score Coupling, a label-free perturbation probe that classifies scoring behavior. Integrating these findings, we propose MADARA, a model-adaptive routing architecture. Crucially, MADARA’s diagnostic thresholds derived from a single pilot model generalize zero-shot to four unseen model families, providing a robust, lightweight pipeline to eliminate computational overhead.

[NLP-51] What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics ECML KDD ICML2026

【速读】: 该论文旨在解决对齐后的大型语言模型(Large Language Models, LLMs)在面对“越狱攻击”(jailbreak attacks)时所暴露的内在安全缺陷问题,即尽管经过安全训练,模型仍可能通过精心设计的提示(prompt)生成违反政策的内容。现有防御机制多聚焦于输入提示或输出结果层面,但缺乏对有害意图如何在模型内部表示中编码的深入理解。本文的关键解决方案是利用“逻辑透镜”(logit lens)技术,分析冻结的LLM在不同层间逐标记的预测熵(predictive entropy)动态轨迹。研究发现,仅依赖提示级别的熵统计量(如均值、方差)难以区分正常与越狱行为,而捕捉熵随标记位置演变趋势的特征(如基于单调性排名的趋势评分)则具有显著更强的判别能力。更重要的是,这种信号并非在整个网络深度上均匀分布,而是集中于中间层,且在最终输出层显著衰减,表明越狱相关的结构化不确定性主要体现在模型的中层表征中,而非输出头。该方法在多个模型(Llama、Qwen、Gemma)和对抗基准测试中均实现了无需额外训练的架构一致分离,揭示了越狱行为在模型内部表现为可识别的中间层不确定性动态模式,明确了编码有害意图的关键特征及其在模型中的空间定位。

链接: https://arxiv.org/abs/2606.25182
作者: Sofiia Nikolenko,Michele Papucci,Mina Rezaei,Shireen Kudukkil Manchingal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2026. A short version accepted at EIML@ICML 2026

点击查看摘要

Abstract:Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model’s internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.

[NLP-52] Hitting a Moving Target: Test-Time Adaptation for AI Text Detection under Continual Distribution Shift

【速读】: 该论文旨在解决现有生成式AI文本检测方法在实际部署后面临的三大分布偏移(distribution shifts)问题:对抗性人类化、新型大语言模型(LLM)的发布以及人类写作随时间产生的风格漂移。传统方法依赖训练阶段的标注数据,但在部署后这些数据往往不可得,导致性能显著下降。其核心缺陷在于未能有效利用大语言模型使用过程中的一个关键信号——推理时样本间的同质性(inference-time homogeneity)。为此,论文提出一种测试时自适应(Test-Time Adaptation, TTA)方法,通过半监督学习框架,在推理阶段利用未标注样本间的同质性来动态适应分布变化。实验证明,当前最先进的监督式检测器在面对对抗性和自然分布偏移时表现系统性失效,而所提出的TTA方法具有高度鲁棒性;例如,商用模型Pangram仅能检测到24.1%的对抗性生成文本,而本方法达到90.5%的检测率。结果表明,测试时自适应是一种面向真实场景下生成式AI文本检测的可行且高效的新范式。

链接: https://arxiv.org/abs/2606.25152
作者: Kevin Ren,Manish Raghavan,Nikhil Garg
机构: Cornell Tech (康奈尔科技); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deployed approaches for AI text detection often rely on training-time access to labeled datasets of both human-written and AI-generated text. This approach is vulnerable to three types of distribution shifts that occur continually post-deployment, and for which labeled data is often unavailable: adversarial humanization, new LLMs being released, and temporal drift in human writing. Simultaneously, existing approaches do not leverage a key signal of LLM usage: inference-time homogeneity. We propose a test-time adaptation (TTA) approach, using semi-supervised learning, that adapts to distribution shifts by leveraging homogeneity among unlabeled samples observed at inference time. Empirically, we find that state-of-the-art supervised detectors systematically fail when they encounter distribution shifts in AI-generated and human writing, both adversarial and natural, while test-time adaptation with semi-supervised learning is largely robust; e.g., the commercial model Pangram detects just 24.1% of our adversarial AI-generated text, compared to 90.5% for our test-time approach. We establish that test-time adaptation is a promising framework for AI text detection in the wild. We publicly release our code (which includes code for model training, evaluation, and plots) at this https URL.

[NLP-53] he cognitive affective and behavioral expression of self-stigma among people who use drugs in online substance use communities

【速读】: 该论文旨在解决物质使用人群(people who use drugs)在社交媒体中自我污名(self-stigma)的多维度表征问题,具体聚焦于认知、情感与行为三个层面的自我污名指标的识别、共现模式及时间演变规律。其核心解决方案在于构建了一个包含十项指标的共识性编码手册(codebook),涵盖认知(如自我标签化、悲观/自挫)、情感(如羞耻、内疚/自责、绝望/无望)和行为(如隐瞒、预期排斥、戒断意愿、矛盾心理)三类维度,并通过双盲编码达成较高一致性(Cohen’s k = 0.72)。随后,研究采用经专家标注验证的大语言模型(large language model)实现大规模文本分类(k = 0.73, F1 = 0.80),对72,115条来自1,660名用户的英文帖子进行分析。关键发现表明:自我污名在个体层面具有高度整合性,核心内部指标与行为指标存在强关联(OR = 4.65),且87.0%的行为类表达均伴随核心指标;更值得注意的是,行为类指标(如戒断意愿)往往早于核心情感指标(如羞耻)出现,挑战了传统渐进式自我污名发展模型。此外,除悲观态度外,其余九项指标在时间轨迹上保持稳定,提示悲观情绪随时间加剧,成为早期数字干预的关键靶点。研究揭示,自我污名并非线性演进过程,其在文本披露中的表现形式与阶段模型不完全对应,强调需基于动态、整合的视角理解在线群体中的自我污名机制。

链接: https://arxiv.org/abs/2606.25143
作者: Layla Bouzoubaa,Hyung Wook Choi,Milan Varghese,Valerie Earnshaw,Rezvaneh Rezapour
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objectives: To develop a codebook for self-stigma across cognitive, affective, and behavioral domains, and to estimate the prevalence, co-occurrence, and temporal patterns of these indicators in Reddit posts by people who use drugs. Methods: We developed a ten-indicator codebook through consensus-based abductive coding spanning cognitive (self-labeling, pessimism/self-defeatism, deservingness/worthlessness), affective (shame, guilt/self-blame, despair/hopelessness), and behavioral (concealment, anticipated rejection, desire to quit, ambivalence) domains; two coders reached substantial agreement (Cohen’s k = 0.72). We then scaled classification with a large language model validated against expert coding (k = 0.73, F1 = 0.80), analyzing 72,115 thread-initiating posts from 1,660 English-language users (2006-2025). Results: 3,838 posts (5.3%) from 1,228 users (74.0%) contained self-stigma; all ten indicators discriminated self-stigma posts (RR 3.6 to 86.2), led by self-labeling (56.0%) and despair/hopelessness (48.5%). Self-stigma was integrated: core and behavioral indicators were strongly associated at the user level (OR = 4.65, 95% CI 3.12-6.94, p 0.001), and 87.0% of posts with behavioral indicators also contained a core indicator. Contrary to progressive models, behavioral indicators emerged earlier than core ones (desire to quit at median position 0.08 vs. shame at 0.38). Nine of ten indicators were stable across posting trajectories; only pessimism increased (OR = 1.62, 95% CI 1.25-2.10). Conclusion: Among people who use drugs online, self-stigma is an integrated phenomenon in which behavioral indicators rarely appear without internalized ones and often precede them. Most expressions remain stable over time, but pessimism about change deepens, marking a target for early digital intervention and showing that progressive stage models do not map directly onto textual disclosure.

[NLP-54] Mind the Heads: Topological Representation Alignment for Multimodal LLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态对齐中存在的粗粒度表示对齐问题。现有方法通常仅对语言模型的固定层进行对齐,忽略了Transformer架构中各注意力头(attention head)的细粒度结构差异,导致对跨模态语义关系的建模不够精确。为此,本文提出一种基于头部级表示对齐(Head-Wise Representation Alignment, HeRA)的新方法,其核心在于在单个注意力头层面实现跨模态对齐。该方法基于柏拉图式表示假说(Platonic Representation Hypothesis),强调保持不同模态间表示的拓扑结构一致性(即局部邻域关系)。通过引入基于互近邻(Mutual K-Nearest Neighbor, MKNN)的对比损失函数,构建可微分的代理目标以匹配局部结构,并依据MKNN得分选择特定注意力头进行对齐。出人意料的是,实验发现对对齐程度最低的注意力头进行对齐反而能带来最大性能提升。大量实验证明,HeRA在多个主流MLLM和18个基准任务上均显著提升了视觉主导型任务的表现,同时有效抑制了视觉幻觉现象,缓解了模型对语言先验的过度依赖,展现出良好的正则化能力。

链接: https://arxiv.org/abs/2606.23885
作者: Davide Caffagni,Alberto Compagnoni,Federico Melis,Sara Sarto,Pier Luigi Dovesi,Mark Granroth-Wilding,Marcella Cornia,Lorenzo Baraldi
机构: University of Modena and Reggio Emilia(摩德纳与雷焦艾米利亚大学); University of Pisa(比萨大学); AMD Silo AI(AMD硅谷人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.

[NLP-55] One Year Later…The Harms Persist But So Do We!

【速读】: 该论文旨在解决当前通用大语言模型(Large Language Models, LLMs)在心理健康相关对话中安全防护机制不健全、且在不同临床情境下表现不一致的核心问题。其关键解决方案在于提出一个包含八个维度的伤害分类体系(harm taxonomy)与多维度评估框架,系统性地评估六种专有大语言模型在16种DSM-5诊断类别下的安全表现。研究发现,仅有自杀与自伤相关情境中的防护措施表现可靠,而进食障碍、物质使用障碍及重度抑郁障碍等常见心理疾病情境下,模型的安全防护失败率可达100%。因此,论文强调,实现生成式AI在心理健康领域的伦理化设计与部署,必须基于明确的临床情境化伤害分类,并据此实施针对性的安全保障机制;否则,这些模型对脆弱群体构成显著风险,尤其在教育场景中日益广泛应用的情况下,更需引起高度重视。

链接: https://arxiv.org/abs/2606.23884
作者: Annika Marie Schoene,Cansu Canca,Gautham Vijay Kumar,Anson Antony
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 tables

点击查看摘要

Abstract:General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into educational settings a particularly concerning.

[NLP-56] ESBMC-PLC: A Unified IEC 61131-3 Formal Verification Framework as a PLCverif Successor

【速读】: 该论文旨在解决现有开源可编程逻辑控制器(PLC)形式化验证平台PLCverif的两大核心局限:一是不支持主流的梯形图(Ladder Diagram, LD)程序,二是依赖CBMC作为后端导致验证仅限于有界证明。其解决方案的关键在于提出ESBMC-PLC+,一个统一框架,首次通过单一ESBMC后端实现对IEC 61131-3标准中三种主要输入格式(梯形图LD、结构化文本ST/SCL及图形化LD)的全面支持,并实现无界安全性证明。具体而言,其创新点包括:(1)基于MATIEC IEC 61131-3编译器构建ST/SCL前端,将编译后的C代码注入ESBMC,结合非确定性输入建模与YAML属性注入以支持k-归纳;(2)为图形化梯形图引入函数块状态语义,扩展深度优先搜索(DFS)解析器,将TON/TOF/TP定时器、CTU/CTD计数器及R_TRIG/F_TRIG边沿触发器建模为在GOTO中间表示中持久化的扫描周期状态变量。实验表明,ESBMC-PLC+在8个基准程序上实现了与PLCverif相当的输入覆盖,且在包含多达8个整型定时器的复杂程序中表现出更强的验证能力;相较于nuXmv的BDD后端,其在定时器相关程序上提速达400–2000倍,并能在nuXmv超时前完成证明。

链接: https://arxiv.org/abs/2606.23870
作者: Pierre Dantas,Lucas Cordeiro,Waldir Junior
机构: The University of Manchester (曼彻斯特大学); UFAM (亚马逊联邦大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 21pages

点击查看摘要

Abstract:PLCverif is the most mature open-source platform for PLC formal verification, developed at CERN and in production use since 2019. Yet it has two fundamental limitations: no support for Ladder Diagram (LD) programs, the dominant PLC notation, and reliance on CBMC as its primary backend, which restricts verification to bounded proofs. The PLCverif authors themselves identified ESBMC as the appropriate backend improvement. Prior work established ESBMC-PLC (a textual LD frontend with k-induction) and ESBMC-GraphPLC (graphical PLCopen XML support); together, they cover LD with unbounded proofs but not Structured Text (ST), and graphical LD with timer/counter function blocks remains unverifiable. This paper presents ESBMC-PLC+, a unified framework that closes both gaps: (1) an ST/SCL frontend via the MATIEC IEC 61131-3 compiler, routing C-compiled ST to ESBMC with nondeterministic input modeling and YAML property injection; (2) function block state semantics for graphical LD, extending the DFS resolver to model TON/TOF/TP timers, CTU/CTD counters, and R_TRIG/F_TRIG edge triggers as persistent scan-cycle state variables in the GOTO IR. ESBMC-PLC+ is the first open-source PLC verification framework to support all three major IEC 61131-3 input formats via a single ESBMC backend, enabling k-induction-unbounded safety proofs. A feature comparison with PLCverif and experimental evaluation on 8 benchmark programs, including programs with up to 8 integer timers, shows that ESBMC-PLC+ matches PLCverif’s input coverage while providing stronger guarantees. Against nuXmv’s BDD backend, ESBMC-PLC+ is 400-2,000x faster on timer programs and completes proofs where nuXmv BDD times out at 120s.

[NLP-57] Learning Diachronic Representations of Ancient Greek Letterforms ICDAR

【速读】: 该论文旨在解决跨世纪手写体变化背景下表征学习的鲁棒性问题,特别是在古希腊语这一使用时间跨度长达数个世纪的书写系统中,如何有效建模字符在长期演变中的稳定性与差异性。其核心挑战包括符号变异、数据稀缺以及系统性退化(如残损、模糊)。解决方案的关键在于提出两种创新机制:一是基于相似度加权的监督对比损失(similarity-weighted supervised contrastive loss),通过动态估计类别间相似性来引导嵌入空间的分布,从而更好地保留字符间的内在关系;二是基于空白缺失(lacuna-driven)的增强策略,模拟真实手稿中常见的缺损情况,提升模型对噪声和不完整输入的适应能力。实验表明,结合上述方法训练的轻量级CNN与预训练ResNet均实现了优异的识别性能,并生成了更具可解释性的嵌入表示,能够有效分离字符类别、聚类出风格子群并构建可视化的时间演化原型图像。研究结果证明,充分考虑字符内部固有关系并引入领域知识驱动的退化增强,可在数据稀缺、时序演进且噪声干扰严重的条件下实现稳健且可解释的表征学习,为类似场景提供了可迁移的技术范式。

链接: https://arxiv.org/abs/2606.24984
作者: John Pavlopoulos,Spyros Barbakos,Lavinia Ferretti,Dionysis Voulgarakis,Asimina Paparrigopoulou,Maria Konstantinidou,Giuseppe De Gregorio,Isabelle Marthot-Santaniello,Paraskevi Platanou,Holger Essler
机构: University of Athens (雅典大学); National and Kapodistrian University of Athens (国家和卡波迪斯特里亚大学雅典); Institute for Language and Speech Processing (语言与语音处理研究所); University of Bologna (博洛尼亚大学); University of Paris (巴黎大学); Aristotle University of Thessaloniki (塞萨洛尼基亚里士多德大学); University of Pisa (比萨大学); University of Hamburg (汉堡大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the International Conference on Document Analysis and Recognition (ICDAR) 2026

点击查看摘要

Abstract:Learning representations that remain robust across centuries of variation in handwriting is a key challenge in diachronic representation learning. Taking one of the longest continuously used writing systems, ancient Greek, as a case study, we introduce three datasets for diachronic representation learning: Hell-Char, a curated training set spanning the 3rd-1st centuries BCE, and two evaluation sets, PaLit-Char (2nd-5th c. CE) and Med-Char (9th-14th c. CE). To address the challenges of symbolic variation, scarce data, and systematic degradation, we propose: a similarity-weighted supervised contrastive loss that biases embeddings using dynamically estimated inter-class similarities, and a lacuna-driven augmentation scheme that simulates realistic manuscript corruptions. Trained with these strategies, both a lightweight CNN and a pretrained ResNet achieve strong recognition performance and produce embeddings that more coherently separate character classes than PCA or generic pretrained models. These embeddings enable clustering, identification of stylistic subgroups, and construction of prototype images that visualize diachronic evolution and transitional letterforms. Our results demonstrate that respecting intrinsic inter-letter relationships and augmenting with domain-informed corruptions yield robust, interpretable representations, offering a transferable paradigm for representation learning under scarce, temporally evolving, and noisy conditions. Code and data available at: this https URL.

[NLP-58] Diagnosing and Mitigating Compounding Failures in Agent ic Persuasion via Taxonomic Strategy Retrieval

【速读】: 该论文旨在解决生成式 AI 代理在多步、开放式环境中因早期错误累积导致的轨迹退化问题,尤其在主观性任务(如说服)中,传统方法易出现策略漂移与顺从性趋同。其核心解决方案是提出一种名为**分类策略检索增强生成(Taxonomic Strategy RAG, TS-RAG)**的系统干预机制,通过引入离散的类别瓶颈(categorical bottleneck),将论证结构与主题内容解耦,从而消除标准检索增强生成(RAG)中由词汇重叠驱动的语义泄漏问题。实验表明,TS-RAG在零样本跨领域场景下显著提升了抽象逻辑的迁移能力,且在不对称部署中充当“能力桥梁”,使轻量级说服代理能够稳定击败参数量更大的对手(胜率从70.5%提升至78.5%),并大幅提高论证效率。此外,研究还提出了逐轮辩论状态表示(Debate State Representation, DSR)作为细粒度诊断工具,揭示了严格约束对于防止评估崩溃及代理默认顺从性至关重要。

链接: https://arxiv.org/abs/2606.24976
作者: Pradyumna Narayana,Sana Ayromlou,Purvi Sehgal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation-model agents in multi-step, open-ended environments frequently suffer from compounding errors, where early mistakes contaminate long-horizon trajectories. While Multi-Agent Debate (MAD) succeeds in deterministic domains, agents in subjective tasks like persuasion experience severe problem drift and sycophantic conformity. We identify semantic leakage in standard Retrieval-Augmented Generation (RAG) as a reproducible trigger for these failures, as standard RAG prioritizes vocabulary overlap over logical necessity. To eliminate this leakage, we introduce Taxonomic Strategy RAG (TS-RAG), a systems intervention that routes strategies through a discrete categorical bottleneck to decouple argumentative structure from topical content. Zero-shot, cross-domain evaluations demonstrate that TS-RAG significantly improves the transfer of abstract logic where standard semantic retrieval collapses. Crucially, TS-RAG acts as a “capability bridge” in asymmetric deployments, empowering lightweight persuaders to consistently defeat parametrically superior opponents (improving win rates from 70.5 to 78.5) and accelerating argumentative efficiency. Finally, we introduce trace-level diagnostics via a turn-by-turn Debate State Representation (DSR), demonstrating the necessity of strict constraints to prevent evaluation collapse via default agentic sycophancy. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.24976 [cs.AI] (or arXiv:2606.24976v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24976 Focus to learn more arXiv-issued DOI via DataCite

[NLP-59] Why Do Accumulated Transformations Extrapolate?

【速读】: 该论文旨在解决大模型在长上下文场景下注意力机制(Attention)性能退化的问题,特别是针对旋转位置编码(RoPE)在超出训练长度时出现的外推能力下降现象。其核心问题是:现有方法中通过位置索引进行旋转的机制是否在长序列下存在固有局限?解决方案的关键在于引入累积式数据依赖变换(accumulated data-dependent transformations),并揭示其内在机制。研究发现,无论采用Householder反射还是简化版的块对角SO(2)旋转,只要将原本基于位置索引的角度替换为随token累积的数据依赖角度,均能显著提升长度外推能力,但会在极端长上下文时出现性能下降。理论分析表明,满足一定正则性条件的累积正交变换在有限步后会生成非相干(incoherent)的注意力分布,从而形成一个与上下文长度无关的有限混合窗口;这种机制使训练中学习到的近距信号抑制模式可直接泛化至任意评估长度,而高维集中效应则产生显著的分数差距,有效抑制远距离令牌的影响,同时保留近端路径的信号传输。然而,反向下界证明:随着远端集合增大,若无显式的远端质量控制,任何累积旋转都无法避免对近端信号的破坏。实验验证进一步支持这一理论——随机累积旋转显著优于RoPE,学习得到的token依赖旋转可在远超训练长度的情况下保持低困惑度,且仅旋转查询和键(query/key)的效果优于仅旋转值(value),但纯旋转模型仍会在极长序列下退化,而ALiBi因具备显式远端质量控制,表现出更强的长度稳定性。因此,该研究揭示了累积变换带来的“隐式短视”特性及其根本局限,并强调了远端质量控制在实现真正长度不变性中的必要性。

链接: https://arxiv.org/abs/2606.24975
作者: Mahesh Godavarti
机构: A Carrot, Inc.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, submitted to TMLR

点击查看摘要

Abstract:PaTH Attention showed that replacing RoPE’s position-indexed rotations with accumulated data-dependent Householder reflections yields strong length extrapolation, though performance degrades at extreme context lengths. We ask whether this depends on Householder-specific structure or reflects a general property of accumulated transformations along source-to-query paths. We study a simpler variant keeping RoPE’s block-diagonal SO(2) rotations but replacing position-indexed angles with accumulated token-dependent ones. It shows the same pattern: improved extrapolation then degradation at long contexts. We prove the result extends to accumulated orthogonal transformations satisfying certain regularity conditions: their products become incoherent after finitely many steps, suppressing attention to distant tokens. Accumulated rotations of queries and keys create a finite mixing window independent of context length; per-token suppression learned in training transfers unchanged to any evaluation length, and high-dimensional concentration produces a score gap suppressing far tokens while near-route transport preserves the target signal. Conversely, a lower bound shows accumulated rotations must eventually degrade: as the far set grows, no rotations preserve the near signal without explicit far-mass control. For SO(2) rotations, rotating values too makes residual far contributions combine incoherently, extending the range. Controlled experiments support these predictions: random accumulated rotations substantially improve extrapolation over RoPE, learned token-dependent rotations maintain near-training-length perplexity far beyond the training context, and rotating values helps over queries and keys alone. Rotation-only models still degrade at extreme lengths, while ALiBi stays length-stable, consistent with the need for far-mass control.

[NLP-60] LLM Performance on a Real Double-Marked GCSE Benchmark

【速读】: 该论文旨在解决教育评估中自动化评分的可靠性与一致性问题,特别是针对英国全国性考试(GCSE)模拟试题中学生真实作答(含手写内容)的客观、高效评分需求。其核心挑战在于验证生成式AI模型在主观题(如英语作文)及复杂手写数学题等任务中的评分表现是否可媲美人类评卷员的一致性。解决方案的关键在于构建一个大规模、多学科、双标注的真实学生作答数据集(共32,534条样本,涵盖5个学科、328道题目),并基于此评估现成大语言模型(off-the-shelf large language models)与人类评卷员之间的评分一致性。研究发现,顶尖模型在多数科目上的评分结果与评卷员共识高度一致,甚至在部分领域优于评卷员间的互评一致性;同时,模型在处理主观性强的任务和杂乱手写文本时表现稳健,且评分一致性不显著依赖于模型规模,表明其具备成本效益高的自动化评分潜力。

链接: https://arxiv.org/abs/2606.24973
作者: Malachy Fox,Kavi Samra,Paul Jung
机构: Medly AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK’s national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. We test whether off-the-shelf large language models agree with examiners as closely as the two examiners agree with each other. We find that models overwhelmingly agree well with the examiner consensus across subjects, with the top performing models agreeing more closely with examiners than examiners agree with each other. Models achieve high scores for subjective tasks like English essay marking, as well as handling complex and messy handwritten Maths paper scripts. Agreement is uniform near the examiner line, and not massively discriminated by model size, providing cost-effective automated marking solutions.

[NLP-61] Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding ICML2026

【速读】: 该论文旨在解决长上下文大语言模型(Long-context Large Language Models, LLMs)在采用推测解码(speculative decoding)时,因验证阶段的键值(Key-Value, KV)缓存加载导致的效率瓶颈问题。现有压缩方法在此场景下表现不佳:静态淘汰策略因显著性偏移(saliency shift)引发精度损失,而动态选择策略则在验证路径中引入过高的计算开销。论文提出的解决方案——Dustin,是一种面向长上下文推测解码的稀疏验证框架,其核心在于融合草稿模型的前瞻信号(lookahead signals)与目标模型的历史注意力信息(historical attention),以高保真度识别多步验证窗口中的关键标记(critical tokens)。为降低重计算延迟,Dustin进一步采用稀疏估计机制,仅对极小部分注意力头进行重要性评分。在PG-19和LongBench数据集上使用Qwen2.5-72B模型的评估结果表明,Dustin在32k序列长度下实现了自注意力计算27.85倍的加速以及端到端解码9.17倍的加速,且精度损失可忽略不计。

链接: https://arxiv.org/abs/2606.24957
作者: WenHung Lee,Jian-Jia Chen,Xiaolin Lin,Pei-Shuo Wang,Chi-Chih Chang,Chun-Che Yang,Ning-Chi Huang,Grace Li Zhang,Kai-Chiang Wu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICML 2026. 9 pages main text, includes references and appendix

点击查看摘要

Abstract:While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency. Existing compression methods fail in this regime: static eviction incurs accuracy loss due to saliency shift, while dynamic selection introduces prohibitive computational overhead during the verification path. We propose Dustin, a sparse verification framework designed for long-context speculative decoding. Dustin integrates lookahead signals from the draft model with historical attention from the target model to identify critical tokens with high fidelity across multi-step verification windows. To reduce recomputation latency, this approach further employs a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads. Evaluations on PG-19 and LongBench with Qwen2.5-72B demonstrate that Dustin achieves a 27.85x speedup in self-attention and a 9.17x end-to-end decoding speedup at a 32k sequence length, all with negligible accuracy degradation.

[NLP-62] Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity

【速读】: 该论文旨在解决旋转机械振动健康监测中在运行数据受限条件下,由于故障事件结构稀缺以及数字孪生生成信号存在的仿真到真实(sim-to-real)异质性差距,导致故障诊断可靠性下降的问题。其核心挑战在于不同故障类型在冲击周期性、幅值调制及频谱特性上存在根本性的特征空间差异,使得跨类特征分布不一致。现有领域自适应方法采用全局无类别区分的变换策略,无法有效弥合特定故障类别的差异而不破坏类间可分性,且均匀混合源域与目标域数据会向数据丰富的正常类引入分布噪声。这些问题的根本原因在于将具有状态依赖性的序列对齐过程误当作一次性优化问题处理,导致每次校正同时改变所有类别分布,形成难以通过静态梯度下降求解的状态依赖关系。为此,本文提出将特征对齐建模为连续动作的马尔可夫决策过程,并采用近端策略优化(Proximal Policy Optimization, PPO)求解,使学习到的策略能够根据当前特征空间配置,动态生成针对不同故障类型的仿射校正,同时通过双目标奖励函数平衡域间差距最小化与类间可分性保持。进一步设计了不对称感知策略,保留真实数据用于正常类,而用经策略对齐的仿真样本增强故障类。在XJTU-SY、CWRU及自建回转轴承试验平台上的验证表明,基于强化学习驱动的特征对齐显著提升了诊断性能,跨设备线性探测无需编码器重训练即可达到92.8%的准确率,证明了该方法具备良好的可迁移监测能力。

链接: https://arxiv.org/abs/2606.24954
作者: Jinghan Wang,Yanjun Chen,Wei Zhang,Wentao Wu,Tianchen Liu,Gaoliang Peng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vibration-based health monitoring of rotating machinery requires reliable fault diagnosis under operational data constraints, yet condition assessment remains challenged by structural scarcity of fault events and heterogeneous sim-to-real gaps in digital twin-generated signals. Each fault type generates impulses with distinct periodicity, amplitude modulation, and spectral character, making feature-space discrepancies fundamentally heterogeneous across fault classes. Existing domain adaptation methods apply a class-agnostic global transformation that cannot close all fault-specific gaps without distorting inter-class separability, while uniform source-target mixing introduces distributional noise into the data-abundant Normal class. These limitations stem from treating a sequential, state-dependent alignment problem as a one-shot optimization. Each corrective transformation simultaneously reshapes all class distributions, creating state dependencies that static gradient descent cannot resolve. We formulate feature alignment as a continuous-action Markov decision process solved via Proximal Policy Optimization, where the learned policy issues fault-type-specific affine corrections responsive to the current feature-space configuration, with a dual-objective reward balancing gap minimization against separability preservation. An asymmetry-aware strategy reserves real data for the Normal class while augmenting fault classes with policy-aligned simulated samples. Validation across XJTU-SY, CWRU, and a self-built slewing bearing testbed confirms the dominant gain from reinforcement learning-driven alignment, and cross-equipment linear probing achieves 92.8% without encoder retraining, demonstrating transferable monitoring capability.

[NLP-63] Perfect Detection Failed Control: The Geometry of Knowing vs. Steering in Language Models

【速读】: 该论文旨在解决生成式模型中“可解释性”与“可控性”之间的核心矛盾问题,即:当能够通过模型激活值检测到某一行为(如幻觉)时,是否也能通过相同或相近的方向实现对该行为的控制。其解决方案的关键在于从几何角度量化“检测方向”与“干预方向”之间的夹角——具体表现为两者方向余弦(cosine)的大小。研究发现,在Gemma 2-2B-it模型中,尽管幻觉实体可被激活空间中的线性分离完美检测(AUC=1.000),但该检测方向与导致拒绝回应的干预方向之间存在显著夹角(cos=0.12,约83°),表明检测与控制在几何上严重脱节。这一现象在多个模型家族和规模下保持一致(cos∈[0.12,0.20]),且不随指令微调改变,说明其根源在于预训练阶段。尽管通过微小旋转(15°)可部分弥合该差距并提升拒绝率,但方向余弦本身并不能预测可操控性,因为可操控性取决于功能层面的动态机制而非静态几何角度。因此,该余弦值作为权重可计算的指标,仅表征“认知”与“引导”之间的解耦程度,而非可操控性的先验预测因子。

链接: https://arxiv.org/abs/2606.24952
作者: Cosimo Galeone,Anna Ettorre,Minsu Park,Giuseppe Ettorre,Daniele Ligorio
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model’s activations, we should be able to modify it. This rests on a hidden premise – that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1.000 from layer 5), yet that direction sits at cos = 0.12 (about 83 degrees) from the direction producing a refusal – a small, reproducible alignment, far from the cos = 1 that “detection is control” would require. A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0.06). The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0.12, 0.20], identical before and after instruction tuning (0.1197 vs 0.1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it – 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives. We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.

[NLP-64] Agent Odyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

【速读】: 该论文旨在解决测试时持续学习(test-time continual learning)中智能体在真实复杂环境中面临的多重挑战,包括有效探索、动态获取世界知识与技能、保留相关情景记忆以及进行长时程规划的能力。其核心问题是现有评估框架无法充分衡量智能体在部署过程中持续学习与推理的综合能力,尤其忽视了学习与推理在时间上交错发生的现实场景。为解决这一问题,作者提出AgentOdyssey——一个基于程序化生成的开放式文本游戏评估框架,能够动态构建包含丰富实体、复杂世界动态和长时程任务的环境,突破传统机器学习中“测试阶段不学习”的假设。该框架的关键创新在于构建了一个连续、长期的任务设置,使智能体在实际部署中持续经历学习与推理的交互过程。此外,论文设计了一套多维度评估方法,不仅衡量游戏进展,还通过诊断性测试评估世界知识获取、情景记忆、对象与动作探索、动作多样性及模型开销等关键指标。实验结果揭示当前主流智能体在多个核心能力上存在显著局限,尽管性能随基础模型增强而提升,但顶尖智能体仍远低于人类表现,表明仍有巨大改进空间。研究进一步发现,短期记忆机制对多种智能体范式均具显著增益,是实现有效测试时训练的重要组成部分。

链接: https://arxiv.org/abs/2606.24893
作者: Zheyuan Zhang,Zehao Wen,Alvin Zhang,Andrew Wang,Jianwen Xie,Daniel Khashabi,Tianmin Shu
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:For agents to learn continuously from interaction with the world at test time, they must be able to explore effectively, acquire new world knowledge and skills, retain relevant episodic experiences, and plan over long horizons. To evaluate these key abilities of test-time continual learning agents, we introduce AgentOdyssey, a novel evaluation framework that procedurally generates open-ended text games with rich entities, world dynamics, and long-horizon tasks. Critically, AgentOdyssey goes beyond the conventional machine learning assumption that learning does not occur at test time by placing agents in a continuous, long-horizon setting that interleaves learning and inference throughout deployment. We further propose a multifaceted evaluation methodology that measures not only game progress but also offers diagnostic tests on world knowledge acquisition, episodic memory, object and action exploration, action diversity, and model cost. We evaluate diverse agent paradigms in the generated games. Our experimental results reveal critical limits in agents’ key abilities, as well as factors that influence their meaningful horizon. Although performance scales with stronger base models, even the top agent remains far below human performance, leaving substantial headroom for improvement. Among agent mechanisms, we find that short-term memory benefits multiple agent paradigms and is an important component of agent test-time training.

[NLP-65] Small edits large models: How Wikipedia advocacy shapes LLM values

【速读】: 该论文旨在解决的问题是:小规模志愿者通过编辑维基百科内容,能否显著影响大语言模型(LLM)在动物福利等议题上的表现与决策倾向。研究发现,尽管仅由少数倡导者组成的“支持动物福利维基人”(Pro-Animal Wikipedians, PAW)团队进行了有限的编辑(共125次编辑覆盖115个页面),但这些编辑对主流语言模型的行为产生了可测量且显著的影响。其解决方案的关键在于:利用梯度数据归因方法(如Bergson、MAGIC)和留子集验证(leave-subset-out validation)技术,量化并证实了维基百科中特定主题内容对语言模型输出的因果性影响。结果显示,在动物福利相关查询中,PAW编辑内容在模型响应中的贡献度高达68%(p < 0.0001),远超其在无关话题中的占比(52%,p = 0.53),且在多组训练顺序种子下,所有最高影响力文档均为PAW编辑内容。此外,基于PAW内容微调的模型在动物福利文本上困惑度降低40%,而控制组模型则在对应控制文本上表现更优。这表明,维基百科作为训练数据的重要组成部分,其内容质量与立场可通过小范围、有组织的编辑活动,被有效放大并嵌入到大语言模型的知识结构中,从而实现对模型行为的定向塑造

链接: https://arxiv.org/abs/2606.24890
作者: Jasmine Brazilek,Maria Navas,Alexa Gnauck
机构: Compassion Aligned Machine Learning (CaML); Pro-Animal Wikipedians (PAW)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Can a small group of volunteers shape how AI systems discuss animal welfare, just by editing Wikipedia? We show that they can. Wikipedia appears in nearly every major language model training dataset and is weighted more heavily than web-crawled text. The Pro-Animal Wikipedians (PAW), a group of advocates who add sourced animal welfare content to relevant articles, have made 125 edits across 115 pages. Using gradient-based data attribution (Bergson; MAGIC), we traced how these edits influence language model behavior. TrackStar retrieval attribution on Llama 3.1 8B found that PAW-edited sections made up 68 percent of the highest-attributed documents for animal welfare queries (p 0.0001) but only 52 percent for unrelated queries about the same companies (p = 0.53): the model links PAW content specifically to animal welfare topics, not to the entities in general. MAGIC counterfactual influence estimation on Llama-3.2-1B, run across five random training-order seeds, gave the same picture even more sharply: in every seed, the top-10 most influential documents on animal welfare queries were all PAW edits (10 of 10, 5 of 5 seeds), while on general queries the same top-10 sat at chance (4 to 6 of 10). Mean PAW influence exceeded mean control influence on animal welfare queries with p 0.0001 in every seed, an effect 6 to 30 times larger than on general queries. Leave-subset-out validation gave Spearman rho = 1.00 for all 10 runs. When we fine-tuned separate models on PAW content versus control content, each model performed better specifically on the type of text it was trained on: the PAW-trained model cut perplexity on animal welfare text from 12.4 to 8.4, while the control-trained model cut perplexity on control text from 16.1 to 11.4. A small, coordinated Wikipedia editing campaign therefore measurably shapes how language models handle the topics those edits address.

[NLP-66] Graph-Based Phonetic Error Correction of Noisy ASR ACL

【速读】: 该论文旨在解决自动语音识别(ASR)系统中残余的词汇错误问题,尤其是对语义关键词(如命名实体、否定词和情感承载词)的错误修正难题。这类错误具有结构性特征,主要源于发音相似性而非随机噪声,因此传统的逐标记纠正方法效果有限。其解决方案的关键在于提出一种名为G-SPIN的结构化ASR纠错框架,通过将音素图建模与上下文语言理解相结合:首先利用图神经网络(GNN)构建声学上合理的候选词邻域,以显式限制纠错搜索空间至音素相近的选项;随后,掩码语言模型(MLM)进行局部上下文评分,最终由指令微调的大语言模型(LLM)在精炼后的候选集上执行上下文感知的重排序。该方法通过解耦结构化的音素推理与语义上下文选择,避免了无约束生成带来的不稳定性,同时显著提升了纠错准确性。整个框架轻量、模块化,且仅在推理阶段运行,具备良好的实用性与可扩展性。

链接: https://arxiv.org/abs/2606.24889
作者: Pratik Rakesh Singh,Mohammadi Zaki,Aneesh Mukkamala,Pankaj Wasnik
机构: Sony Research India
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at ACL Industry Track 2026

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing words. These errors are often structured, arising from phonetic similarity rather than random noise, making naive token-level correction insufficient. We propose a structured ASR correction framework, that we call G-SPIN, that combines phonetic graph modeling with contextual language understanding. A graph neural network (GNN) first constructs acoustically plausible candidate neighborhoods for flagged tokens, explicitly restricting the correction search space to phonetic alternatives. A masked language model (MLM) then provides local contextual scoring, and an instruction-tuned large language model (LLM) performs final context-aware re-ranking over this compact candidate set. By decoupling structured phonetic reasoning from contextual semantic selection, our method avoids unconstrained generation while improving correction accuracy. The framework is lightweight, modular, and operates entirely at inference time.

信息检索

[IR-0] Are We Ready For An Agent -Native Memory System?

链接: https://arxiv.org/abs/2606.24775
作者: Wei Zhou,Xuanhe Zhou,Shaokun Han,Hongming Xu,Guoliang Li,Zhiyu Li,Feiyu Xiong,Fan Wu
类目: Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Paper list available at: this https URL . Source code available at: this https URL

点击查看摘要

Abstract:Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at this https URL.

[IR-1] AutoRelAnnotator: Calibrated Model Cascades for Cost-Efficient Relevance Evaluation in Sponsored Search SIGIR2026

链接: https://arxiv.org/abs/2606.25871
作者: Md Omar Faruk Rokon,Shasvat Desai,Hong Yao,Kuang-chih Lee
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at E-commerce workshop, SIGIR 2026

点击查看摘要

Abstract:How can we generate high-quality relevance annotations at scale without the cost and delays of human labeling? Relevance annotations are the backbone of search ranking systems which is needed for training data preparation, NDCG evaluation, and root cause analysis. However, human annotation is slow and off-the-shelf LLMs suffer from accuracy on domain-specific tasks. We propose a calibrated model cascade, a systematic approach for cost-efficient offline relevance annotation by routing queries through progressively larger fine-tuned classifiers. Our central insight is that accuracy and cost are orthogonal optimizations: domain-specific fine-tuning drives accuracy, cascading drives cost, and per-class isotonic calibration adds a small but reliable gain on top. Our contribution is threefold: (a) we decompose the gains and show that fine-tuning contributes 20 accuracy points while cascading is approximately accuracy-neutral but halves compute cost, (b) we introduce per-class isotonic calibration as one component of the cascade, contributing a small but statistically significant gain (+0.6 points over the strongest calibration baseline), and © we validate the system in production across six offline use cases, processing 150M+ annotations and enabling faster experimentation cycles. Our work is a building block for scalable, high-quality offline annotation pipelines in search and advertising systems.

[IR-2] A Stochastic Epidemiological Model of Latent Tuberculosis in a Radiation Exposed Mars Colony

链接: https://arxiv.org/abs/2606.25728
作者: Teddy Lazebnik
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Plans to establish a sustained human presence on Mars have moved from speculative ambition toward concrete engineering programmes, making the biological consequences of settlement an increasingly practical question. A Mars colony would place a small, closed population in an environment combining chronic radiation, altered immunity, constrained medical autonomy, and engineered indoor air. Latent infections are especially important because clinically silent carriers may become sources of transmissible disease when host control deteriorates. In this study, we develop a stochastic host-radiation-pathogen-habitat model of latent tuberculosis reactivation in a Mars colony. The model links galactic cosmic radiation to immune competence, immune competence to latent-tuberculosis reactivation, and reactivation to airborne transmission in a closed habitat. We also formulate countermeasure allocation as a partially observable sequential decision problem in which isolation and medication are selected by fixed baselines or by a proximal policy optimization policy trained on an agent-based simulator. Our simulations show that active tuberculosis can emerge endogenously despite no initial infectious cases, and that risk is most sensitive to latent reservoir size, radiation-immune coupling and reactivation sensitivity. Adaptive control reduced infectious burden and mortality while limiting unnecessary intervention. This framework supports mission-specific stress testing of screening, monitoring, shielding and treatment strategies before launch.

[IR-3] racing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

链接: https://arxiv.org/abs/2606.25721
作者: Yan-Lun Chen,Pin-Yu Chen,Chia-Mu Yu,Ying-Dar Lin,Yu-Sung Wu,Wei-Bin Lee
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate model outputs through malicious retrieved documents. Existing detection methods typically rely on auxiliary classifiers or additional LLM-based verification, introducing substantial computational overhead. We present TRACE, a lightweight detection framework that identifies poisoning attacks by tracing answer-related tokens through token influence attribution. TRACE first discovers recurrent high-influence keywords across retrieved documents and then performs a secondary verification to confirm their influence on model predictions. Experiments on three QA benchmarks and six LLMs demonstrate strong detection performance while simultaneously uncovering attacker-specified target answers.

[IR-4] BitNet Text Embeddings

链接: https://arxiv.org/abs/2606.25674
作者: Zhen Li,Xin Huang,Liang Wang,Nan Yang,Ting Song,Yan Xia,Xun Wu,Shaohan Huang,Huishuai Zhang,Furu Wei,Dongyan Zhao
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:LLM-based text embedders have substantially improved retrieval and semantic representation quality, but their deployment remains costly: large backbone models slow down embedding inference, while high-dimensional full-precision embeddings impose substantial storage and bandwidth overhead on large-scale indexes. In this paper, we present BITEMBED, an extreme low-bit framework for LLM-based text embedding that jointly targets encoding efficiency and vector storage. BITEMBED converts pretrained LLM backbones into BitNet-style embedding encoders with ternary weights, quantized activations, and lightweight normalization refinement. The converted model is adapted to representation learning through continual contrastive pre-training, followed by supervised contrastive fine-tuning with both similarity-distribution distillation and attention-relation distillation from a full-precision teacher. Beyond quantizing the backbone, BITEMBED further trains output embeddings to support multiple storage precisions meeting different storage needs in various scenarios. Experiments on MMTEB (eng, v2) with Qwen3-0.6B and Gemma3-270M show that BITEMBED is largely comparable to full precision teacher embedders. Moreover, BITEMBED flexibly obtains text embeddings of various precisions, achieving a trade-off between performance and storage cost.

[IR-5] Is GraphRAG Needed? From Basic RAG to Graph-/Agent ic Solutions with Context Optimization ACL2026

链接: https://arxiv.org/abs/2606.25656
作者: Long Chen,Ryan Razkenari,Yuxuan Zhou,Yuan Tian,Rahul Ghosh,Venkatesh Pappakrishnan,Disha Ahuja,Vidya Sagar Ravipati
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ACL 2026 GEM Workshop

点击查看摘要

Abstract:As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge bases, including regular RAG, GraphRAG, Modular RAG and Agentic RAG. We provide implementation for 9 standardized RAG scenarios, and conduct experiments for a comprehensive comparison. These scenarios are designed for real use cases regarding data and domain restrictions, spanning from simple document-based retrieval to advanced features such as hybrid text-graph retrieval, integration with computed or pre-defined domain knowledge graphs, agentic multi-step planning, and agent-graph integration. Besides, we present a novel context engineering method for GraphRAG and Agentic RAG, addressing the context/memory overflow issues, efficiently managing text and graph retrievals with new representations and agentic loop design, leading to 19%-53% reduction on token usage. Moreover, further analysis identifies a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality, suggesting retrieval-oriented metrics overstate advanced retrieval benefits. This work provides data-driven insights on when and how to use them for building production-ready intelligent RAG systems.

[IR-6] Recommendation as Generation: Unifying Personalized Video Generation and Recommendation at Industrial Scale

链接: https://arxiv.org/abs/2606.25496
作者: Yanhua Cheng,Bo Wang,Haotian Zhang,Xinyuan Gao,Zhihui Yin,Ben Xue,Yongzhi Li,Jieting Xue,Ye Ma,Minquan Wang,Jiahui Li,Tianyu Xu,Zhiqiang Liu,Xiao Lin,Shiyang Wen,Changcheng Li,Liu Liu,Quan Chen,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR)
备注: Project page: this https URL

点击查看摘要

Abstract:Traditional short-video recommendation systems match user interest to a fixed pool of pre-produced videos, which limits their ability to capture fine-grained and dynamic preferences. We propose Recommendation-as-Generation (RaG), a new paradigm that generates personalized videos on demand from inferred user interest. Our framework unifies generative recommendation and video generation through shared semantic IDs (SIDs), which disentangle video representation into content semantics and creative style semantics, enabling both fine-grained modeling of user interest and controllable generation of interest-aligned videos. We further develop Video Generation Agents (VGAs) that are conditioned on inferred SIDs to drive hierarchical planning and refinement for video creation, including visual composition, audio alignment, and artistic effect enhancement. To optimize the framework, we effectively introduce a synergistic cross-domain reward learning mechanism that jointly enforces interest alignment, user feedback, and video quality assessment. We deploy RaG on an industrial-scale platform with over 400 million daily active users and evaluate it in a revenue-critical advertising scenario. Online A/B tests show up to 1.87% ad revenue improvement compared to a strong production GRM baseline, demonstrating its effectiveness in driving further revenue gains beyond generative recommendation. Our results highlight a closed-loop generative system as a promising paradigm for integrating personalized video generation into recommendation. Comments: Project page: this https URL Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.25496 [cs.IR] (or arXiv:2606.25496v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.25496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] S2-CAR: Segmentation-Supervised Complexity-Adaptive Recommendation

链接: https://arxiv.org/abs/2606.25415
作者: Linjiang Guo,Nitin Bisht,Shiqing Wu,Xianzhi Wang,Guandong Xu
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Sequential recommendation aims to predict user preferences from interaction histories, yet existing models often struggle when behavior patterns become complex and heterogeneous. A key reason is that interaction histories are rarely uniform: users’ interests shift in a latent way over time, yet existing models either treat the full sequence as a homogeneous context or rely on rigid time-window segmentation that misaligns with true intent boundaries. This mis-segmentation not only introduces cross-intent interference at intermediate sequence positions but also leads to over-reliance on short-term interest signals. To address this, we propose S2-CAR, a segmentation-supervised and complexity-adaptive framework for sequential recommendation that models user intent as a continuous latent energy state. Specifically, it uses the Context-Aware Soft Temporal Point Process (Soft-TPP) to segment boundaries triggered by the natural decay of latent-state energy rather than fixed intervals, enabling intent segmentation without fixed time-gap rules. Next, upon this segmentation, a Segment-Count-Adaptive Multi-Intent Extraction module hierarchically aggregates intent-coherent segments into a compact set of multi-interest representations. Extensive experiments on 3 representative public benchmark datasets spanning movie, e-commerce, and gaming domains across 13 baselines demonstrate that S2-CAR consistently outperforms state-of-the-art methods across all datasets and metrics. Further analysis shows that the proposed energy-based segmentation serves as a plug-and-play module, yielding consistent improvements when integrated into existing sequential recommendation backbones.

[IR-8] hree Buddhist Vocabularies: Computational Stylometry of the English Pali Canon across Sutta Vinaya and Abhidhamma

链接: https://arxiv.org/abs/2606.25372
作者: Joy Bose
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 16 pages, 7 figures, 3 tables. code available at this https URL

点击查看摘要

Abstract:We present a computational stylometric analysis of the Tipitaka across all three Pitakas in English translation, extending earlier work on the Sutta Pitaka alone. The corpus spans 134,831 segments from Bhikkhu Sujato’s Sutta Pitaka (114,591 segments, CC0), Bhikkhu Brahmali’s Vinaya Pitaka (7,923 segments, CC0 2026), I.B. Horner’s 1938 Vinaya translation (2,826 segments), three English translations of the Abhidhammattha Sangaha compendium (2,077 segments), and cross-tradition Vinaya texts from the Dharmaguptaka and Mulasarvastivada schools. We compute Zipf rank-frequency distributions with OLS-fitted exponents, Moving Average TTR (MATTR-500), numeral-word density, and vocabulary overlap (Jaccard and Szymkiewicz-Simpson coefficients). Main findings: (1) all corpora show Zipf-consistent distributions (R2 0.989); the Vinaya is closest to ideal Zipf slope -1 and the Sangaha corpus deviates most, with ‘consciousness’ displacing grammatical particles at rank 8; (2) MATTR-500 shows the Sutta and Vinaya Theravada are nearly identical in lexical diversity (0.399 and 0.400), while the Sangaha corpus is genuinely more diverse (0.560), confirmed by size-controlled subsampling; (3) the Sangaha corpus has the highest numeral-word density (3.26%), consistent with its systematic enumeration of mental and material categories; (4) the Mulasarvastivada Vinaya shares 20.0% vocabulary (Jaccard) and 49.1% (overlap coefficient) with the Theravada Vinaya, reflecting shared legal heritage across two millennia; (5) two English translations of the same Vinaya source text share only 24.2% of their vocabulary across 88 years, with ‘musing’ versus ‘absorption’ for jhana and ‘defeat’ versus ‘expulsion’ for parajika as the most diagnostic shifts. All results are point estimates; no significance testing is conducted. Code and data are released as open-source extensions to the Darshana Graph corpus (arXiv:2606.18222).

[IR-9] heoremGraph: Bridging Formal and Informal Mathematics

链接: https://arxiv.org/abs/2606.25363
作者: Simon Kurgan,Evan Wang,Eric Leonen,Sophie Szeto,Luke Alexander,Artemii Remizov,Jarod Alper,Giovanni Inchiostro,Vasily Ilin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); History and Overview (math.HO)
备注: 31 pages, 9 figures, 21 tables

点击查看摘要

Abstract:Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly: informal papers cite mostly at the document level, while formal libraries record fine-grained dependencies over a much smaller body of mathematics. We introduce TheoremGraph, a unified statement-level dependency graph spanning both informal and formal mathematics. On the informal side, we parse 11.7M theorem-like environments from mathematics arXiv and recover 18.3M candidate directed dependencies, each labeled by the extractor that proposed it so downstream users can trade coverage for precision. On the formal side, we release LeanGraph, a Lean 4 elaborator-level extractor producing 388,105 declaration nodes and 11.3M typed edges across 25 Lean projects. We bridge the two graphs by embedding generated natural-language slogans into a shared semantic space, linking related statements across papers and across the informal/formal divide; an LLM judge affirms 47,952 such matches above a 0.8 cosine floor, with the judge-acceptance rate rising from 48% across the floor to 87% in the =0.9 tier. On formal concept retrieval, our name-and-signature representation with graph expansion comes within 0.5pp of LeanSearch v2’s reranked Recall@10 (0.775 vs. 0.780) without an LM reranker. We release the dataset, extractors, HTTP API, and MCP interface as infrastructure for mathematical search, attribution, and retrieval-augmented reasoning, available at this http URL and this http URL.

[IR-10] Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

链接: https://arxiv.org/abs/2606.25361
作者: Yuxin Wang,Paul Thomas,Zhiwei Yu,Yuan Gao,Saeed Hassanpour,Soroush Vosoughi,Robert Sim,Nick Craswell
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Prior research on memory mechanism in RAG-based conversational system has emphasized how memory is stored and retrieved. However, far less is known about how memories with different functional roles influence response quality. Specifically, how they shape an agent’s responses under varying conversational contexts and whether they lead to substantively different response behaviors. Existing evaluations in conversational system are also largely reference-based, insufficiently capturing the nuances in responses that may address users’ preferences differently. In this work, we probe the impact of different memory types in shaping agents’ responses. We present a fine-grained taxonomy of conversational memory, classify retrieved memories into different role types, and design a user-centric evaluation framework that simulates user perspectives. Through comparative experiments on long-term datasets and frontier LLMs, our analysis reveal many differentiated effects of memories: e.g., clarifying memory improves responses’ factual accuracy and constraint awareness, making them more correct and personalized; irrelevant memory reduces topic relevance and degrades constraint awareness. Despite the power of frontier LLMs, these findings shed light on how different memory types can be leveraged to produce more personalized responses and inspire further research in this direction.

[IR-11] Data-Driven Evolution of Library and Information Science Research Methods (1990-2022): A Perspective Based on Fine-grained Method Entities

链接: https://arxiv.org/abs/2606.25320
作者: Chengzhi Zhang,Yi Mao,Shuyu Peng
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Since the 1990s, advancements in big data and information technology have increasingly driven data-centric research in the field of Library and Information Science (LIS). To assess the influence of this data-driven research paradigm on the LIS discipline, this study conducts a fine-grained analysis to uncover the evolutionary trends of research methods within the domain. Using academic papers from LIS published between 1990 and 2022, four key categories of data-driven method entities are automatically extracted: algorithms and models, data resources, software and tools, and metrics. Based on these entities, the study examines the evolution of LIS research methods from three dimensions: the characteristics of research method entities over time, their evolution within different research topics, and the evolutionary features of research method entities across various research methods. The findings highlight data resources as a pivotal driver of methodological evolution in LIS, revealing a cyclical pattern of “emergence-stability/practical application” in the development of research methods within the field.

[IR-12] Measuring Research Difficulty of Academic Papers: A Case Study in Natural Language Processing

链接: https://arxiv.org/abs/2606.25307
作者: Haochuan Li,Jingyuan Li,Yi Zhao,Heng Zhang,Yukai Yang,Zile Hu,Chengzhi Zhang
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:With the rapid growth of the number of academic papers, systematically evaluating the difficulty of research and its relationship to academic impact offers important significance for research topic selection and resource allocation. However, current studies lack quantitative assessments of research difficulty and its correlation with academic impact. This paper proposes a comprehensive evaluation system for research difficulty, incorporating factors such as academic collaboration, content, and references. Taking the field of Natural Language Processing (NLP) as a case study, we extract both internal and external features from academic papers, compute multiple research difficulty indicators. We assign their weights using the entropy weight method and perform a weighted sum to obtain the research difficulty score of academic papers. This paper uses the citation frequency of academic papers to measure academic impact. To validate our approach, NLP experts assessed the difficulty of a sample of papers, and correlation analyses confirmed the reliability of our measurement. Empirical results reveal that in NLP, factors such as the number of pages, reference count, and participation of high-level institutions are significantly associated with academic impact. Moreover, we identify an inverted U-shaped relationship between research difficulty and academic impact. It suggests that moderately difficult research tends to achieve greater academic impact.

[IR-13] Automatic Generation of Highlights for Academic Paper Via Prompt-based Learning

链接: https://arxiv.org/abs/2606.25253
作者: Yi Xiang,Chengzhi Zhang,Heng Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Highlights provide a concise summary of the main contributions of an academic paper and help readers quickly understand its focus. However, many journals do not provide highlights, which limits their use in literature retrieval, text mining, and bibliometric analysis. Existing studies have explored supervised learning methods for automatic highlight extraction, but these methods usually require large amounts of labeled training data. This study investigates prompt-based learning for automatic highlight generation. We design task-specific prompt templates and combine them with paper abstracts as model inputs. Several language models are evaluated, including locally deployed pre-trained models such as GPT-2 and T5, as well as ChatGPT accessed through an API. Experiments on three datasets show that ChatGPT with prompt templates achieves performance comparable to previous supervised methods without using task-specific training samples. When a small number of examples are added to the prompts, the model significantly outperforms state-of-the-art methods on two datasets. We further analyze how prompt design affects generation quality and find that, although ChatGPT has strong language modeling ability, its performance on this task is highly sensitive to the information provided in the prompt. Case studies also show that the generated highlights are generally coherent, informative, and close to author-written highlights. This study is among the first to apply prompt-based learning to academic highlight generation. The proposed method does not rely on domain-specific training corpora and can generate highlights for papers that lack such information, thereby supporting downstream text mining and bibliometric research.

[IR-14] Adaptive Re-Ranking

链接: https://arxiv.org/abs/2606.25249
作者: Ata Cinar Genc,Emir Kaan Korukluoglu,James Allan
类目: Information Retrieval (cs.IR)
备注: 7 pages

点击查看摘要

Abstract:Modern Information Retrieval (IR) systems typically use a “retrieve-then-rerank” pipeline, where a computationally expensive, pre-determined cross-encoder re-ranks the top results from a fast initial retriever. While effective, this approach often applies heavy re-ranking models regardless of query complexity, resulting in high latency and wasted computational resources on simple queries. We propose Adaptive Re-Ranking, an utility-based labeling framework for cost-aware routing and present empirical evidence (via oracle analysis and a trained baseline router) that per-query routing offers large potential gains but is non-trivial to learn from limited supervision. We train a routing classifier with 3 strategies: sparse retrieval (BM25), dense re-ranking (MiniLM-L6-v2), and heavy neural re-ranking (BGE-v2-m3). Compared to BGE our method achieves 1.15-53x lower median latency and 1.11-5.22x lower mean latency across all datasets we have tested, while delivering -17.5% to +4.0% nDCG@10, which is competitive in some datasets. Our findings show that routing queries based on our novel utility function offers a scalable solution for reducing computational costs and latency in a variety of IR systems.

[IR-15] Extreme Meta-Classification for Large-Scale Zero-Shot Retrieval KDD2024

链接: https://arxiv.org/abs/2606.25237
作者: Sachin Yadav,Deepak Saini,Anirudh Buvanesh,Bhawna Paliwal,Kunal Dahiya,Siddarth Asokan,Yashoteja Prabhu,Jian Jiao,Manik Varma
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at KDD 2024, 20 pages

点击查看摘要

Abstract:We develop accurate and efficient solutions for large-scale retrieval tasks where novel (zero-shot) items can arrive continuously at a rapid pace. Conventional Siamese-style approaches embed both queries and items through a small encoder and retrieve the items lying closest to the query. While this approach allows efficient addition and retrieval of novel items, the small encoder lacks sufficient capacity for the necessary world knowledge in complex retrieval tasks. The extreme classification approaches have addressed this by learning a separate classifier for each item observed in the training set which significantly increases the representation capacity of the model. Such classifiers outperform Siamese approaches on observed items, but cannot be trained for novel items due to data and latency constraints. To bridge these gaps, this paper develops: (1) A new algorithmic framework, EMMETT, which efficiently synthesizes classifiers on-the-fly for novel items, by relying on the readily available classifiers for observed items; (2) A new algorithm, IRENE, which is a simple and effective instance of EMMETT that is specifically suited for large-scale deployments, and (3) A new theoretical framework for analyzing the generalization performance in large-scale zero-shot retrieval which guides our algorithm and training related design decisions. Comprehensive experiments are conducted on a wide range of retrieval tasks which demonstrate that IRENE improves the zero-shot retrieval accuracy by up to 15% points in Recall@10 when added on top of leading encoders. Additionally, on an online A/B test in a large-scale ad retrieval task in a major search engine, IRENE improved the ad click-through rate by 4.2%. Lastly, we validate our design choices through extensive ablative experiments. The source code for IRENE is available at this https URL. Comments: Accepted at KDD 2024, 20 pages Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2606.25237 [cs.IR] (or arXiv:2606.25237v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.25237 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM, 2024 Related DOI: https://doi.org/10.1145/3637528.3672046 Focus to learn more DOI(s) linking to related resources

[IR-16] okenMinds: Pretrained User Tokens and Embeddings for User Understanding in Large Recommender Systems

链接: https://arxiv.org/abs/2606.25147
作者: Qingyun Liu,Bo Yan,Yang Liu,Yuji Roh,Ekansh Sharma,Likang Yin,Emma Olowo,Min-hsuan Tsai,Yuxuan Li,Diego Uribe,Saksham Aggarwal,Siqi Wu,Yuan Hao,Vikas Kedigehalli,Lukasz Heldt,Lichan Hong,Li Wei,Xinyang Yi
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:User modeling in industrial recommender systems typically produces dense embeddings, which suffer from representational constraints inherent to fixed-dimensional vectors. An emerging alternative for discrete user representation – using LLMs to generate text-based user tokens – captures topical co-occurrences rather than deep sequential behavior dynamics and produces outputs that are difficult to ground to item attributes. Meanwhile, Semantic ID (SID) based item tokenization has proven effective for improving generalization in generative recommendation, yet discrete SID-based representations for users remain largely unexplored. We propose TokenMinds, an industrial-scale system that extends the PLUM framework from item retrieval to user modeling, generating both discrete SID-based user tokens and dense user embeddings via an encoder-decoder architecture adapted from pre-trained LLMs. This dual-output design provides the complementary benefits of discrete, semantically grounded user representations while maintaining compatibility with existing downstream models that rely on dense embeddings. Additionally, the shared SID vocabulary naturally extends to cross-scenario modeling: by unifying long-form and short-form video behaviors into a single model, we substantially reduce training and serving costs. We validate TokenMinds through extensive offline experiments and live launches on multiple YouTube surfaces, served on full user traffic (billions of users) via an asynchronous infrastructure that decouples representation generation from downstream scoring. Focusing on ranking as the primary downstream use case, our results confirm the practical viability of SID-based user tokens at industrial scale and demonstrate that tokens and dense embeddings provide complementary value across different production ranking systems.

[IR-17] Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification ACL26 ACL2026

链接: https://arxiv.org/abs/2606.23881
作者: Qian Ma,Qiong Wu,Zhengyi Zhou,Yao Ma
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by ACL 2026 Findings. Project page this https URL

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.

[IR-18] HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

链接: https://arxiv.org/abs/2606.23843
作者: Hoang-Bao Le,Aiden Durrant,Thai Son Mai,Binh T. Nguyen,Liting Zhou,Cathal Gurrin
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode “what an image is not” alongside “what it is.” HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.

[IR-19] EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

链接: https://arxiv.org/abs/2606.23724
作者: Fengchen Gu,Xiaotian Ren,Zhengyong Jiang,Zhilu Zhang,Ángel F. García-Fernández,Angelos Stefanidis,Mian Zhou,Huakang Li,Jionglong Su
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to answer questions over annual reports, earnings decks, and analyst notes, yet their outputs remain difficult to verify in high-stakes financial workflows. A fluent answer can blend directly grounded statements, weak synthesis, and unsupported claims across narrative text, tables, and charts. We present EvidenceLens, a visual analytics prototype that treats financial question answering as a claim-evidence alignment problem. The system decomposes an answer into atomic claims, summarizes support composition and confidence, support gaps, and coordinates claim-level inspection with source passages, table cells, and chart regions. Its core visual representation is a multimodal claim-evidence matrix that makes coverage, contradiction, and modality imbalance immediately visible. To support reproducibility, we also specify a JSON-based artifact schema, a lightweight multimodal alignment pipeline, and a deterministic review-priority ranking that maps backend signals into an auditable visual structure. Through representative report-auditing scenarios, we show how EvidenceLens helps analysts distinguish grounded claims from overconfident synthesis that conventional chat interfaces flatten.

[IR-20] he Hitchhikers Guide to Agent ic AI: From Foundations to Systems

链接: https://arxiv.org/abs/2606.24937
作者: Haggai Roitman
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Hitchhiker’s Guide to Agentic AI is a comprehensive practitioner’s reference for building autonomous AI systems. The book covers the full stack from first principles to production deployment, organized around a central thesis: building great agentic systems requires understanding every layer of the pipeline, not just one. The book opens with the LLM substrate – transformer architecture, GPU systems, training and fine-tuning (SFT,LoRA, MoE), model compression, and inference optimization – treated as essential foundations rather than the primary focus. It then develops the alignment and reasoning layer: reinforcement learning from human feedback (RLHF), PPO, DPO and its variants, GRPO, reward modeling, and RL for large reasoning models including chain-of-thought and test-time scaling. The second half is devoted to agentic AI proper. Topics include agentic training and trajectory-based RL, retrieval-augmented generation (RAG and Agentic RAG), memory systems (in-context, external, episodic, and semantic), agent harness design and context management, and a taxonomy of agent design patterns. Inter-agent coordination is covered in depth: the Model Context Protocol (MCP), agent skills and tool use, the Agent-to-Agent (A2A) communication protocol, and multi-agent architectures spanning centralized, decentralized, and hierarchical topologies. The book concludes with agent development frameworks, agentic UI design, evaluation methodology for agentic tasks, and production deployment. Each chapter pairs rigorous theoretical foundations with implementation guidance, code examples, and references to the primary literature.

[IR-21] Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

链接: https://arxiv.org/abs/2606.24915
作者: Mohammad Aref Jafari-Raddani
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 4 pages, 1 figure, 2 tables

点击查看摘要

Abstract:End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using large language models, current architectures face significant challenges. They either rely on standard sparse retrieval that ignores phonetic misrecognitions or utilize heavyweight cross-modal embeddings that introduce high latency. This letter proposes a highly efficient, purely lexical error-aware framework designed to explicitly resolve phonetic and loop hallucinations. Our approach integrates a symmetric text normalization module with a novel error-aware term frequency-inverse document frequency algorithm. By constructing a sparse diagonal penalty matrix based on historical errors, the retriever mathematically prioritizes corrective documents containing specific high-risk misrecognitions. Evaluated on the Persian subset of the FLEURS dataset, our method increased the error-aware hit rate from 53.7% to 90.9%. In end-to-end evaluations, the integrated framework reduced the final word error rate from 23.06% to 18.83%, achieving significant accuracy gains with near-zero inference latency.

[IR-22] Invisible to humans visible to machines: a preregistered audit of Unicode fidelity across four biomedical bibliographic APIs

链接: https://arxiv.org/abs/2606.24897
作者: Przemysław Czuma
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 1 figure. Pre-registered on OSF. Data and code available on Zenodo and GitHub

点击查看摘要

Abstract:Biomedical text mining, scientometrics, and the construction of training corpora for biomedical large language models (LLMs) all assume that the abstract text returned by a bibliographic API faithfully reproduces the published abstract. This pre-registered audit (OSF this http URL) tests that assumption for four widely used public APIs (PubMed E-utilities, Crossref, OpenAlex, Semantic Scholar) against PubMed Central (PMC) JATS XML as a common ground truth. From a complete enumeration of the PMC Open Access subset for 2024 (about 700,000 records), a simple random sample of 4,000 English-language research articles was drawn; for each, we recorded whether Unicode characters from four pre-specified classes present in the JATS abstract (typographic punctuation, mathematical/scientific symbols, Greek letters, special whitespace) were preserved by each API. Two systematic, deterministic losses met the pre-registered criterion (upper 95% CI bound below 5%): the PubMed AbstractText field preserved typographic punctuation in only 0.6% of eligible abstracts (95% CI 0.3-1.0%), and OpenAlex preserved special whitespace in 0% (0.0-0.4%). A blinded mechanism audit attributed the first loss to character substitution and the second to inverted-index serialization. Mathematical symbols and Greek letters were preserved faithfully (over 95%) by all four APIs. Separately, Crossref returned no abstract for 24.6% of papers (coverage 75.4%, 95% CI 74.1-76.7%), concentrated in specific publishers (Elsevier and ACS: 0%). Character-level fidelity is therefore API-dependent and undocumented: the same publisher-deposited JATS text carries different surface signatures depending on the serving API, with direct consequences for tokenization-sensitive bibliometrics, corpus construction, and character-level indicators of LLM-assisted writing.

人机交互

[HC-0] “Zooming In” on Agent ic Web Browsers as Assistive Technologies: A Case Study with a Low-Vision Technology Expert

链接: https://arxiv.org/abs/2606.24870
作者: Laura Colazzo,Giuseppe Anzillotti
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Agentic Web Browsers (AWBs), powered by Large Language Models (LLMs), are emerging as autonomous systems capable of navigating the Web on behalf of users. Beyond enhancing productivity, they could also offer significant promise as Assistive Technologies (ATs) for visually-impaired individuals, transforming web interaction into a fluid conversational exchange. In this paper, we present a case study with a low-vision technology expert, examining how AWBs can support visually-impaired users in web navigation. The findings show that, despite the current limitations, the navigation experience is notably fluid and flexible, underscoring the strong potential of AWBs to enhance accessibility and reduce barriers in web interaction, with implications that may extend beyond accessibility to agentic UX more broadly.

[HC-1] Its Complicated: On the Design and Evaluation of AI-Powered AAC Interfaces

链接: https://arxiv.org/abs/2606.24854
作者: Blade Frisch,Will Wade,Dylan Gaines,Michelle Kinsella,Betts Peters,Tamara Broderick,Keith Vertanen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at Speech AI for All: The What, How, and Who of Measurement Workshop at the CHI Conference on Human Factors in Computing Systems, Barcelona, Spain, 2026

点击查看摘要

Abstract:Artificial intelligence (AI) can enhance what people who use augmentative and alternative communication (AAC) are able to do with their systems. However, evaluating AI-powered AAC interfaces can be difficult. People are intersectional beings and current evaluation metrics can struggle to capture the multifaceted and nuanced desires people may have for their AAC. We explore the complicated nature of six AAC problem spaces, explore how AI might be used in these spaces, and suggest more robust methods of evaluation that take the intersectional nuances of people into account. We also discuss broader issues that arise across these problem spaces and how they could be addressed using our proposed evaluation methods.

[HC-2] Virtual Simulation for Mental Health

链接: https://arxiv.org/abs/2606.24826
作者: Anna Fang
类目: Human-Computer Interaction (cs.HC)
备注: Doctoral dissertation

点击查看摘要

Abstract:Poorly designed interventions or those deployed without adequate safeguards can harm the communities they aim to serve, thus exacerbating existing vulnerabilities and leaving individuals unsupported. This is especially the case for the mental health context, where there is a growing trend of relying on technological interventions due to their accessibility and ability to deliver large-scale support. However, the mental health context is also particularly sensitive to change and risks of failure are dire; at their worst, failures in mental health interventions can result in lasting negative outcomes for individuals and tragic losses as people fall through the cracks. Thus, enabling safe ways to experiment in the mental health context is vital to allow both individuals and communities to engage with new interventions without risk of their real-world consequences. Virtual simulation, which uses virtual environments to replicate real-world interactions, processes, and behaviors, offers a promising opportunity for enabling safe, controlled experimentation with its ability to accurately replicate social situations, fears, stressors, and the potential outcomes of specific interactions. This work explores how simulation approaches can support emerging mental health processes through (1) evaluating community-level outcomes using agent-based modeling and (2) individual training in the mental health context through embodied, controlled spaces. I demonstrate this use of virtual simulation systems through a grounded human-centered approach, where system design is guided by empirical understanding of current real-world needs and challenges. By leveraging simulation to create environments where mental health strategies can be safely tested and practiced, this work aims to open new possibilities for designing scalable, user-centered systems that are effective and safe.

[HC-3] Assessing Distribution Shift in Human Activity Recognition for Domain Generalization

链接: https://arxiv.org/abs/2606.24781
作者: Rebecca Adaimi,Edison Thomaz
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 22 pages with references

点击查看摘要

Abstract:While the field of Human Activity Recognition (HAR) continues to draw interest from researchers and advance in important ways, some key challenges remain. One of the most difficult aspects of building HAR models that show good performance in real-world settings is dealing with data diversity from device and sensor heterogeneity, and contextual changes that are intrinsic to real-world applications. While data diversity in HAR has been well-acknowledged in the literature, there remains a gap in understanding the effect of various types of distribution shifts on HAR models and the domain generalization problem that arises. Towards that end, this paper systematically evaluates 4 different types of distribution shifts, including variations in device type, sensor placement, sampling rate, and user behavior. Quantifying their effects, we illustrate that diversity shifts predominantly define all types of shifts, indicating the existence of unique features that are not shared across different domains. We then introduce a uniform HAR-based distribution shift benchmarks and conduct a comprehensive evaluation of up to 28 domain generalization methods. Our analysis exposes the limitations of current domain generalization algorithms in achieving model generalizability, marginally outperforming the empirical risk minimization baseline. This work represents the first systematic exploration of domain generalization and adaptation concerning specific distribution shifts in sensor-based HAR, offering an open-source benchmark platform and datasets to spur further research.

[HC-4] Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent -Supported Interface

链接: https://arxiv.org/abs/2606.25941
作者: Faliang Yin,Hak-Keung Lam,David Watson
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Increasing demand for precise and reliable control in complex scenarios has led to the development of increasingly sophisticated controllers, including data-driven approaches employing closed box models and mathematically rigorous yet complex designs. This complexity highlights the needs for explainable control that can provide human-understandable insights into controller behavior. In this paper, an explainable control framework (XCF) along with supporting algorithms and user interface are proposed to explain how controllers determine their control actions and their underlying working mechanism. The novel contributions of this work are threefold: First, the XCF is designed to provide model-agnostic explanations for controllers in closed-loop systems and can optionally refine local explanations by system response dynamics. Second, a novel explanation method, hierarchical fuzzy model-agnostic explanation for control systems (HFMAE-C), is proposed based on the designed framework. The HFMAE-C employs a fuzzy logic system to approximate the controller’s behavior and system dynamics, providing sample, local, domain and universe level explanations via IF-THEN rules revealing the controller’s decision logic and salience values quantifying the contribution of system states to control actions. Third, a large language model agent-supported user interface is developed to automatically analyze user requirements, select appropriate algorithms, interpret the generated explanations to a natural language report, and provide interactive consultation. Case studies on inverted pendulum system and Turtlebot obstacle avoidance demonstrate the effectiveness of the proposed method through simulated user experiments and quantitative comparisons with mainstream explainable control approaches.

[HC-5] hemis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

链接: https://arxiv.org/abs/2606.24622
作者: Andreas Chouliaras,Luke Connolly,Dimitris Chatzpoulos
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: The extended version of a paper published at the 2026 IEEE Conference on Artificial Intelligence (CAI). Includes an additional appendix with extended derivations and supplementary results. The main paper has 8 pages, 6 figures, 1 table

点击查看摘要

Abstract:Training safe Reinforcement Learning (RL) systems is inherently challenging, with no guarantee of avoiding unwanted behaviors. The most effective defenses against this are (i) transparency through explainability and (ii) alignment via human feedback. While both show promising results, no publicly available framework currently combines them. To address this, we introduce Themis, an XAI-enabled testing and evaluation framework for Reinforcement Learning from Human Feedback. Themis supports over 200 widely used environments and is easily configurable for experiments in RL, transparency, and alignment. Our results show that Themis can train reward models that match or outperform the environment’s true reward signal using human preferences. We also provide a cloud-based platform for collecting human feedback and managing experiments. It is user-friendly, auto-scalable, and supports large participant groups across multiple experiments without extra development overhead. Tests show Themis can support one thousand users in back-to-back experiments on a modest commercial machine.

[HC-6] Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation IJCAI2026

链接: https://arxiv.org/abs/2606.24515
作者: Marta Sumyk,Oleksandr Kosovan
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the 4th International Workshop on Generalizing from Limited Resources in the Open World (GLOW @ IJCAI 2026)

点击查看摘要

Abstract:Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalable supervision signal for GUI agents. Given a final screenshot and the original instruction, a Vision-Language Model judges task completion and provides terminal feedback without task-specific heuristics or manual labels during policy optimization. Because autonomous evaluators are imperfect, we model their feedback as a noisy binary reward channel and derive a noise-corrected reward estimator for Proximal Policy Optimization. Experiments across macOSWorld, Windows Agent Arena, and OSWorld show that corrected evaluator rewards outperform both zero-shot baselines and raw evaluator rewards, improving success rates by an average of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning. These results suggest that autonomous evaluation can serve as a practical reward signal for RL in GUI environments when evaluator noise is explicitly modeled and corrected. Comments: Accepted to the 4th International Workshop on Generalizing from Limited Resources in the Open World (GLOW @ IJCAI 2026) Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.24515 [cs.AI] (or arXiv:2606.24515v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-7] Dissociable Spatial and Temporal Effects of Interaction Latency in Virtual Reality

链接: https://arxiv.org/abs/2606.25681
作者: Xiaoye Michael Wang,Catherine M. Sabiston,Timothy N. Welsh
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Motion-to-photon latency is inherent in immersive virtual reality (VR) systems and can arise from multiple sensorimotor loops, including view-contingent latency between head movement and display update and interaction latency between hand movement and the virtual effector. Although prior work shows that interaction latency can impair VR performance, it remains unclear whether common spatial, temporal, and efficiency measures reveal the same latency-related disruption. This study addressed this question by experimentally imposing delays between the physical and virtual hands during manual pointing in VR. Participants pointed to targets on a horizontal surface in VR and in the physical environment as an unmediated baseline. In VR, pointing was performed with a virtual hand avatar controlled by a motion capture pipeline, and additional delays (0-500 ms) were imposed between the participant’s hand movement and the rendered movement of the virtual hand. Relative to the baseline, performance in VR showed greater endpoint error, longer movement time, greater endpoint variability, and lower throughput. Within VR, added interaction latency further increased endpoint error and variability, reduced throughput, and altered movement time, but these effects followed different profiles: endpoint error increased even at the shortest delays, whereas movement time remained stable at short delays and increased primarily at longer delays. These findings show that interaction latency produces dissociable spatial and temporal consequences in immersive VR, such that endpoint accuracy revealed disruption before movement time or throughput. Thus, latency-sensitive VR interactions cannot be fully evaluated using movement time or efficiency measures alone. Instead, HCI evaluations should assess both spatial and temporal performance, particularly when VR tasks involve visually guided manual actions.

[HC-8] Reason able Motion: A General ASP Foundation for Environment Constrained Movement Trajectory Computation

链接: https://arxiv.org/abs/2606.25626
作者: Julius Monsen,Jakob Suchan,Mehul Bhatt,Lars Karlsson
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted at: LPNMR 2026 - 18th International Conference on Logic Programming and Non-monotonic Reasoning, 7 - 11 September 2026 - Klagenfurt, Austria

点击查看摘要

Abstract:We present a general answer set programming based hybrid quantitative-qualitative method for computing constrained branching trajectory modes for moving objects in real-world settings. The method performs constrained traversal of an environment graph, enumerating geometrically admissible motion behaviours as stable models, each constituting a distinct trajectory mode characterised by both domain-dependent and independent factors such as derived event sequence, map topology, and domain norms. The hybrid trajectory computation method is generally applicable across motion characteristics typically encountered in diverse dynamic domains with moving objects, e.g., autonomous driving. We demonstrate applicability and highlight how computed trajectories are traceable to their underlying stable model, thereby affording verifiable interpretability that purely learned approaches cannot provide. We also perform an empirical evaluation with Argoverse 2, a large-scale real-world autonomous driving benchmark representative of the class of dynamic domains within the scope of the proposed method.

[HC-9] When LLM Rationales Become User-Facing: Effects on Trust Perception Decision-Making and Gaze Behaviors

链接: https://arxiv.org/abs/2606.25489
作者: Xin Sun,Ting Pan,Yajing Wang,Shu Wei,Jos A. Bosch,Isao Echizen,Abdallah El Ali,Saku Sugawara
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly show step-by-step reasoning rationales alongside their answers, turning reasoning from an internal model capability into a user-facing interface feature. Yet it is unclear whether such rationales help users judge when trust is warranted or merely persuade through fluent reasoning. We address this gap through the lens of auditable trust calibration: user-facing rationales should help people inspect whether an answer is warranted by evidence. We test this framing in factual verification through two linked studies. Study 1, an online experiment (N=68), manipulated rationale presentation format (instant, delayed, on demand), rationale correctness (correct, incorrect), and certainty framing (none, certain, uncertain). Study 2, a controlled eye-tracking study (N=54), examined how no-, correct-, and incorrect-rationale conditions were associated with users’ trust, decision-making, and eye-movement patterns. Study 1 showed no reliable presentation-format effects; instead, rationale correctness and certainty framing influenced the trust in the information, trust in the LLM system, and decision confidence. In Study 2, incorrect rationales drew more attention to the supporting evidence and larger pupil diameter while the rationale was viewed, consistent with greater cognitive effort. Incorrect rationales also lowered trust in LLM system relative to showing no rationale, whereas the no-rationale difference was weaker for trust in information. A post-hoc predictive modeling analysis of gaze data from Study 2 further showed that gaze features carried predictive signal for trust- and decision-related user states. This work challenges the assumption that more reasoning is always better and supports rationale designs that are selective, linked to evidence, calibrated in how they express certainty, and easier to verify.

[HC-10] AI Coaching for Accelerating Human Skill Development with Reinforcement Learning

链接: https://arxiv.org/abs/2606.25337
作者: Wei Wang,Enlin Gu,Antonio Loquercio,Haimin Hu,Rahul Mangharam
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI copilots can substantially boost human performance through shared control, but excessive assistance can induce over-reliance and skill atrophy. This paper studies how an embodied AI agent can act as a coach that accelerates human motor-skill development. We argue that effective coaching requires strategic scaffolding and stepping back that are aligned with the learner’s capability, allowing productive failures that drive learning. We formalize the interactive AI coaching process as a non-cooperative dynamic game in which the learner optimizes task performance while the coach targets the learner’s independent competence. Building on this formalism, we develop a reinforcement learning framework combining adaptive shared control with probabilistic models of the coach’s causal influence on skill evolution, enabling tractable training of coaching policies. A comprehensive user study (N=33) on first-person-view drone racing shows significant gains in human learning outcomes over state-of-the-art AI coaching baselines.

[HC-11] he Digital Pirahã Condition: Ecological Mismatch and the Reconstruction of Recursive Cognition

链接: https://arxiv.org/abs/2606.25287
作者: Dhushy Thillaivasana,Samar Shailendrab,Kristina Nichollsc,Deepani Guruged
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 1 figure

点击查看摘要

Abstract:Contemporary digital and AI-mediated environments are reshaping the cognitive ecologies within which human reasoning develops. As everyday activity becomes embedded in datafied infrastructures, cognitive habits adapt to conditions of immediacy, fragmentation, externalisation, and algorithmic filtering. This paper introduces the Digital Pirahã Condition, a cultural ecological model explaining how these environments cultivate adaptive but shallow cognitive patterns, epistemic flattening, reduced recursive capacity, and heightened reliance on external scaffolds. While functional within digital systems, these adaptations create an ecological mismatch with the recursive, integrative reasoning required in academic and institutional activity systems. The paper argues that this mismatch is an ecological outcome rather than a psychological deficit, and that addressing it requires intentional cognitive niche construction within educational institutions. The lecturer is conceptualised as a cultural entrepreneur who reconstructs the cognitive ecology of learning through analog sanctuaries, AI-supported metacognitive scaffolds, and recursive curriculum architectures. The Digital Pirahã Condition thus provides a theoretical lens for understanding contemporary cognitive change and a framework for ecological redesign in AI-mediated societies.

[HC-12] Co-designing a Preliminary Repository of Augmented Reality Concepts for Real-Time Emotion Regulation

链接: https://arxiv.org/abs/2606.25271
作者: Graciela Camacho-Fidalgo,Edgar Rojas-Muñoz
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Augmented Reality (AR) can be a positive therapeutic approach to support mental health and emotion regulation. Although AR techniques for therapeutic support exist, there is no user-centered, expert-informed understanding of how real-time AR designs can support people in emotional distress without disengaging them from their ongoing activities. This lack of reusable design resources hinders the adoption of AR for mental health support. This paper addresses this gap by introducing a co-designed collection of AR interventions describing how this technique can support real-time emotion regulation. The repository was created following a two-phase participatory design process. Phase 1 recruited 40 anxiety-prone individuals and used the Nominal Group Technique to list ideas on how AR affordances could support emotion regulation. Phase 2 recruited 10 mental health professionals to organize these ideas into thematic clusters and assess their clinical feasibility. The resulting AR design repository, grounded in user perspective and clinical expertise, identifies eight thematic clusters and 106 design ideas. This work represents a first step towards the development of seamless real-time AR interventions for mental health.

[HC-13] FUTO Swipe: Layout-Agnostic Neural Swipe Decoding

链接: https://arxiv.org/abs/2606.25247
作者: David Lee Miller,Aleksandras Kostarevas
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural swipe decoders are typically tied to the keyboard they were trained on, requiring a new corpus and training run for each layout. In this report, we document our approach toward training models that can function on any contiguous mobile keyboard layout. At each point along the swipe, our encoder predicts whether the user is indicating a character and where on the keyboard that character lies. The keyboard layout is supplied at inference time and used to map the spatial and temporal prediction to a logit at each key, rather than being learned during training. Training neural models requires substantial data, but public swipe data is limited, particularly for non-QWERTY layouts. We release this http URL, the largest MIT-licensed swipe corpus we are aware of, containing over 1M donated swipes from more than 12k donor sessions. To generalize beyond the English QWERTY layout, we apply geometric augmentations to both the swipe trajectory and the keyboard layout at every training step, forcing the model to make predictions based on characteristics of the swipe gesture rather than the training layout. The model generalizes to layouts absent from training, in some cases more accurately than the layout it was trained on. This combines the layout-flexibility of an algorithmic decoder with the accuracy of a neural model. Trained models are publicly available. Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2606.25247 [cs.HC] (or arXiv:2606.25247v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.25247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-14] ARTOO-DARTU: Studying AR-HRC With AR Obstruction Mitigation During a Warehouse Task

链接: https://arxiv.org/abs/2606.25202
作者: Christian Fronk,Hanting Ye,Zhehan Qu,Maria Gorlatova
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: To appear in Proceedings of the ACM on Human-Computer Interaction, Vol. 10, No. 5, Article MHCI7668, MobileHCI 2026

点击查看摘要

Abstract:Human-robot collaboration (HRC) often requires robot intentions and internal states to be conveyed to users for task efficiency and safety. Recently, augmented reality (AR) situated analytics provide such real-time robot feedback in HRC contexts. However, AR situated analytics can obstruct important real-world elements, posing safety and usability risks, especially when content is dynamically positioned relative to movements of mobile robots in a warehouse HRC scenario. In this paper, we introduce the Augmented Reality Technique Of Obstruction Deterrence while Aiding Robotic Teaming for Users (ARTOO-DARTU), an AR system tailored specifically for warehouse HRC that enables real-time robot situated analytics and control while preserving visibility of the real world through an obstruction detection and mitigation pipeline (ODM) that is uniquely suited for AR-HRC. To evaluate ARTOO-DARTU, we developed Pocket MonstARs, a controlled gamified abstraction of HRC warehouse inventory picking in which virtual monsters serve as proxies for pick targets, while labeled and object-marked boxes preserve the real-world identification demands of the picking task. In a 34-participant user study, we found that our designed AR situated analytics yielded a 46% increase in efficiency on the overall HRC task, but only when the ODM was active. Participants with the ODM active were also 61% faster on subtasks requiring visibility of the real world. Our findings demonstrate that, when paired with our developed ODM to prevent real-world obstructions, the situated analytics in ARTOO-DARTU can significantly enhance efficiency and user experience in AR-HRC warehouse scenarios.

[HC-15] EveLoad: Cognitive Workload Recognition from Event-Based Eye Movements

链接: https://arxiv.org/abs/2606.25177
作者: Guorui Lu,Shaohua Guan,Zhen Xu,Qinyu Chen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: 10 pages, 6 figures, intended to submit as a IEEE transaction paper

点击查看摘要

Abstract:Cognitive workload monitoring is important for adaptive rehabilitation and assistive interfaces, where task difficulty, pacing, and feedback should be adjusted according to the user’s cognitive state to avoid overload and under-challenge. Emerging extended reality and robot-assisted rehabilitation environments provide controllable training tasks, but they require unobtrusive sensing methods that can capture rapid ocular dynamics during interaction. Existing eye-movement-based cognitive workload recognition methods mainly rely on frame-based eye trackers, which often suffer from limited temporal resolution and degraded robustness under rapid eye movements. In contrast, event cameras provide microsecond-level temporal resolution, high dynamic range and low latency, making them suitable for capturing fine-grained ocular dynamics. Many previous studies rely on free-viewing or similar paradigms, where gaze locations can vary across tasks. As a result, models may learn associations between gaze-location distributions and cognitive workload, rather than workload-related eye movement characteristics themselves. In this work, we introduce EveLoad, which, to the best of our knowledge, is the first event-based eye-movement dataset with graded cognitive workload annotations, collected from 20 healthy participants under spatially constrained and task-driven conditions using a controlled N-back-guided fixation paradigm. Based on this dataset, we establish a benchmark for cognitive workload recognition with six workload levels and propose a learning framework that encodes spatiotemporal event representations. Experimental results show that our approach achieves an average subject-specific accuracy of 96.36% and 96.13% under mixed random split evaluation. These results suggest that event-based eye movements may provide a useful sensing pathway for future workload-aware rehabilitation.

[HC-16] fARfetch: Enabling Collocated AR-HRC in Large Visually Diverse Environments with VLM-Driven AR Content Adaptation

链接: https://arxiv.org/abs/2606.25162
作者: Christian Fronk,Hanting Ye,David Hunt,Miroslav Pajic,Maria Gorlatova
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to the 2026 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). Author accepted manuscript

点击查看摘要

Abstract:Augmented Reality (AR) can improve collocated human-robot collaboration by making robot state and intent visible and enabling intuitive control, yet large, visually diverse environments like the outdoors challenge both interaction and content legibility, especially at long distances and beyond visual line of sight. We present fARfetch, an AR-HRC system that integrates (i) shared semantic environment mapping across an AR headset and robot that visualizes detected landmarks in AR to support landmark-grounded go-to commands, (ii) a context-aware world-in-miniature representation of the shared environment for fine-grained path authoring, and (iii) vision-language-model driven AR view management that jointly adapts virtual content color, size, and orientation to maintain legibility in large visually diverse environments. We implement fARfetch with a Meta Quest 3 headset and Unitree Go2 quadruped robot, and conduct a within-subjects user study (N=13) on a real-world large-scale (30.5m) outdoor inspection task. fARfetch yielded significantly faster completion times than a non-AR baseline (66%) and significantly lower workload in mental demand (-43%), temporal demand (-34%), and frustration (-66%). A custom legibility survey indicated fARfetch effectively maintained virtual content legibility in the large outdoor environment.

[HC-17] Proactive Systems in HCI and AI: Concepts Challenges and Opportunities

链接: https://arxiv.org/abs/2606.25149
作者: Nima Zargham,Sharon Ferguson,Jaisie Sin,Cosmin Munteanu,Anastasia Kuzminykh
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The last few years have seen a significant rise in interest in highly autonomous and proactive systems, fueled by advances in AI. Systems that anticipate user needs, take initiative, and act without explicit user input. Such systems span a wide range of applications, from smart lighting that adapts to user activity to assistive robots that plan actions in advance to intelligent thermostats that learn routines and adjust environments proactively. Despite this breadth, the concept of proactivity remains loosely defined and inconsistently applied across research and practice. Current usage of the term often conflates fundamentally different system behaviors. For instance, simple reminders or recommendation systems are frequently labeled as proactive, even though underlying mechanisms and intentions differ significantly. This conceptual ambiguity limits our ability to systematically design, compare, and evaluate proactive systems. Moreover, existing methodologies for design and evaluation are largely rooted in reactive interaction paradigms, failing to address the unique challenges posed by proactive behavior, including timing, appropriateness, user control, transparency, and trust. This multidisciplinary workshop aims to establish a clearer and more rigorous foundation for understanding proactive systems. We bring together researchers and practitioners from Human-Computer Interaction, AI, and related fields to (1) develop a shared conceptualization of proactivity, (2) identify gaps and limitations in current design and evaluation approaches, and (3) co-create human-centered guidelines and research directions for future systems. Through interactive discussions and collaborative activities, the workshop seeks to map key challenges and opportunities, ultimately advancing robust and consistent frameworks for designing and evaluating proactive technologies.

[HC-18] Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLM s)

链接: https://arxiv.org/abs/2606.23840
作者: Marvin Pafla,Jesse Hoey,Kate Larson,Mark Hancock
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages

点击查看摘要

Abstract:Explainability is often framed as a property of an AI model, with explanations extracted from its internals and shown to users. In this argument paper, we instead provide an embodied account of explainability based on Dourish and enactivist cognition: understanding is created in use as people act on affordances in shared practice. Using demonstrations and conceptual analysis, we reveal ontological obstacles when “looking inside” large language models: surrogates import external abstractions that can be mistaken for the model’s, and focusing on internal reasoning misses that explainers participate in their own understanding. We discuss these obstacles in XAI practice, arguing that many explanations are misnamed, which skews their purpose and can increase overreliance. Finally, we highlight how embodied explanations reorganize sense-making by making what matters publicly available for action, and argue that explainability claims should be reserved for designs that provide affordances to probe, coordinate, and repair behaviour in situated practice.

[HC-19] n Digits on a Train: AI-Assisted Verification of Two Eigenvalue Problems

链接: https://arxiv.org/abs/2606.23821
作者: Matthew J. Colbrook
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Spectral Theory (math.SP)
备注:

点击查看摘要

Abstract:Accurate numerical eigenvalues are often difficult to certify, especially in singular or non-normal settings. This article reports a human–AI collaboration on two such computations. For a singular self-adjoint Schrödinger operator, a verified zero count and Dirichlet–Neumann bracketing certify the complete negative spectrum to ten decimal places. For a delicate non-normal atom–molecule benchmark, a previously unresolved resonance pair is separated, with each member enclosed to ten digits. The second result is achieved not by increasing the precision of one-way shooting, but by reformulating the problem as a global matching system for projective solution lines. The infinite tail is encoded as uncertainty in the terminal projective data, and a componentwise, tail-robust Krawczyk–Brouwer inclusion supplies the certificate. This gives a reusable architecture for analytic boundary-value systems with ill-conditioned propagation and uncertain asymptotic data. The collaboration also exposes the strengths and limits of AI assistance. AI rapidly produced accurate candidates and plausible proof strategies, but several failed, including one apparently complete tail argument that omitted the componentwise check required by a nonuniform polydisc. Validated computation is a stringent test of AI-assisted mathematics: the output is not merely a number, but a number with a proof. These examples show why the proof object matters, and why human mathematical judgment remained decisive. More broadly, as AI makes code, exposition, and plausible numerical claims inexpensive, standards for verification, attribution, peer review, and training must adapt. The implications are unsettling; the opportunity is extraordinary.

计算机视觉

[CV-0] DiffusionBench: On Holistic Evaluation of Diffusion Transformers

链接: https://arxiv.org/abs/2606.24888
作者: Xingjian Leng,Jaskirat Singh,Zhanhao Liang,Ethan Smith,Martin Bell,Aninda Saha,Yuhui Yuan,Liang Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.

[CV-1] BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases

链接: https://arxiv.org/abs/2606.24883
作者: Qi Chen,Wenxuan Li,Pedro R. A. S. Bassi,Xinze Zhou,Jakob Wasserthal,Ibrahim Ethem Hamamci,Sezgin Er,Ashwin Kumar,Yiwen Ye,Yuhan Wang,Yuyin Zhou,Akshay S. Chaudhari,Curtis Langlotz,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code

[CV-2] FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

链接: https://arxiv.org/abs/2606.24876
作者: Orest Kupyn,Goutam Bhat,Philipp Henzler,Fabian Manhardt,Christian Rupprecht,Federico Tombari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at this https URL

[CV-3] FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

链接: https://arxiv.org/abs/2606.24874
作者: Haorui Ji,Weizhe Liu,Hongdong Li,Hengkai Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.

[CV-4] IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

链接: https://arxiv.org/abs/2606.24849
作者: Zixuan Li,Haokun Lin,Yicheng Xiao,Zhiwei Li,Xinyang Song,Zelong Zheng,Yong He,Heng Yao,Ke Ding,Chao Yu,Chuan Yuan,Qi Li,Zhenan Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.

[CV-5] Spherical-to-ERP Epipolar Rectification for Single-Axis Disparity in 360 Stereo

链接: https://arxiv.org/abs/2606.24847
作者: Sahereh Obeidavi,Dieter Landes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 4 Figures, Conference

点击查看摘要

Abstract:Omnidirectional stereo images provide full-surround perception but violate the geometric assumptions of classical disparity estimation: in spherical or fisheye views, epipolar correspondences follow curved great-circle paths, producing two-dimensional displacements that cannot be treated as single-axis disparity before geometric rectification. In this work, we adopt a standard spherical-to-equirectangular (ERP) projection as a preprocessing step, which straightens epipolar curves and restores a one-dimensional disparity structure - horizontal for left-right rigs and vertical for top-bottom rigs. Building on our previously introduced RAFT + Epipolar-Aligned Channel Selection (EACS) framework, originally developed for rectilinear and ERP stereo, we examine whether the same modular pipeline remains accurate when the input originates from spherical stereo imagery. After ERP projection, dense optical flow from RAFT is reduced to disparity by retaining only the baseline-aligned flow component. Experiments on synthetic fisheye stereo datasets show that this spherical-to-ERP-to-RAFT+EACS pipeline produces accurate, smooth, and structurally consistent disparity maps at real-time speed. These findings confirm that established ERP preprocessing can be effectively combined with our earlier RAFT+EACS method to enable practical, interpretable, and efficient disparity estimation from spherical stereo, providing a straightforward pathway for extending conventional stereo pipelines to 360 imaging.

[CV-6] Bridging the Manifold Gap: Riemannian Residual Line Search for One-Step Image Editing

链接: https://arxiv.org/abs/2606.24844
作者: Hongzhu Yi,Zhongtian Luo,Tong Li,Yiyan Fan,Jungang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One-step diffusion editors are fast because they avoid inversion and iterative optimization, but a single transport update must be aggressive enough to realize the target prompt and conservative enough to preserve the source image–and no fixed update strength satisfies both demands across edit types. We treat this tension as a post-hoc candidate-selection problem on top of energy-field transport rather than as a new editing model. Our proposed method, Riemannian Residual Line Search, first builds a stronger edit by estimating the local time curvature of the prompt-delta field and projecting the corrected direction back onto the update norm of the original first-order energy-field transport estimation. It then forms a small residual path from the source image to this strong edit, retains the original first-order output as one candidate, and picks the final image by maximizing target-prompt CLIP alignment. On a 700-sample PIE-Bench++ evaluation across 10 edit type IDs, our method achieves state-of-the-art (SOTA) performance among current one-step update algorithms.

[CV-7] GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

链接: https://arxiv.org/abs/2606.24829
作者: Chenrui Fan,Paolo Favaro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 17 figures, 18 tables

点击查看摘要

Abstract:Camera-prompted text-to-video (T2V) models are increasingly used to synthesize virtual camera captures, such as orbiting objects or moving through static scenes. For these outputs, visual plausibility is insufficient: the generated frames should also provide coherent multi-view evidence for a single static 3D scene. We introduce GeoT2V-Bench, a reconstruction-based diagnostic benchmark for evaluating whether camera-prompted T2V clips can support explicit rigid 3D reconstruction. Our pipeline estimates per-frame camera intrinsics and poses with VGGT-style geometry estimation, fits DeformableGS, derives a static MedianGS proxy by temporal-median aggregation, and renders this proxy along the estimated camera path. Instead of producing a pass/fail label or a single scalar score, GeoT2V-Bench reports a continuous reconstruction profile covering apparent image motion, estimated trajectory behavior, MedianGS static rendering error, static-render flow agreement, and the gap between flexible and static fits. On a fair-format four-seed evaluation with 3,840 completed reconstructions from 12 open-weight model configurations and 80 GeCo-Eval static-scene prompts, we find that visible motion, static rendering error, flow agreement, and flexible-vs-static behavior often disagree. GeoT2V-Bench therefore captures complementary failure modes that emerge when generated videos are tested as global static-scene acquisitions.

[CV-8] High-Fidelity Synthetic Transmission Electron Microscopy Image Generation Using Diffusion Probabilistic Models for Data-Limited Semiconductor Metrology

链接: https://arxiv.org/abs/2606.24817
作者: Johannes Boehm,Bappaditya Dey
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To be presented at the 2026 International Symposium ELMAR, published by IEEE in the conference proceedings

点击查看摘要

Abstract:Advanced semiconductor nodes drastically increased demand for Transmission Electron Microscopy (TEM), yet destructive sample preparation, slow imaging and high costs severely limit the availability of diverse datasets needed for downstream machine learning (ML). Synthetic data generation is becoming essential, but current generative models often miss TEM-specific noise, structural detail, and stochastic variability crucial for evaluation. We present a Denoising Diffusion Probabilistic Model (DDPM) framework for synthetic TEM image generation under extreme data scarcity. A progressive patch-based training strategy scales from low-resolution patches to full images, enabling from-scratch training with only 15 samples. We integrate a custom TrivialAugment adaptation, cross-process domain transfer, classifier guidance, and RePaint-style inpainting, culminating in full-image generation that preserves global structural and spatial relationships in compliance with FAB metrology requirements. Beyond synthesis, we repurpose DDPM feature representations for segmentation, partitioning encoder feature maps to obtain coherent region masks. Our synthetic images achieve up to MS-SSIM 0.98 and qualitative expert assessment consistent with structural similarity results, facilitating downstream ML training for defect detection, segmentation, and metrology while preserving statistical and physical realism.

[CV-9] DDStereo: Efficient Dual Decoder Transformers for Stereo 3D Road Anomaly Detection

链接: https://arxiv.org/abs/2606.24805
作者: Shiyi Mu,Zichong Gu,Zhiqi Ai,Yilin Gao,Shugong Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo-based 3D object detection still faces two critical safety challenges: real-time performance and open-set generalization. Existing stereo 3D methods typically achieve twice the accuracy of monocular methods but suffer from significantly lower inference speeds, making them unsuitable for real-time applications. Meanwhile, recent advances in open-world detection have introduced open-set and open-vocabulary algorithms in monocular 2D and 3D settings, yet stereo-based open-set detection remains largely unexplored. To bridge this gap, we propose DDStereo, a novel Dual-Decoder Stereo Transformer for real-time open-set 3D object detection. DDStereo features two lightweight decoder branches: one for open-set foreground 2D detection and the other for 3D attribute regression. These decoders share object-level queries to achieve unified target-level alignment. To enhance inference efficiency, we designed a compact disparity feature extractor and a streamlined decoder architecture. Experiments on public stereo 3D benchmarks demonstrate that DDStereo achieves state-of-the-art accuracy under both closed-set and open-set protocols. Notably, our method surpasses existing stereo 3D detectors in inference speed and, for the first time, achieves real-time performance comparable to monocular approaches.

[CV-10] OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

链接: https://arxiv.org/abs/2606.24799
作者: Chenrui Fan,Paolo Favaro
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 40 pages, 33 figures, 19 tables

点击查看摘要

Abstract:Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today’s generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, and frames often contain inconsistencies across time. We introduce OrbitForge, an adapter built from frozen video priors and per-prompt Gaussian Splatting reconstruction optimization that converts a single text-generated video into a canonical closed-orbit 3D Gaussian Splatting scene. We use 3D reconstruction as an anchor to improve the 3D consistency of the generated video. We obtain a preliminary 3D reconstruction from a first generated video via Deformable Gaussian Splatting with a robust MedianGS proxy. We render views from a prescribed orbit to detect missing viewpoints. OrbitForge uses the text-to-video model to complete only the missing views, and reconstructs the completed orbit into a final Gaussian Splatting scene. This design requires no task-specific video or multiview fine-tuning, avoids per-prompt score-distillation optimization, and does not progressively generate views one step at a time. We further argue that this setting demands coverage-aware evaluation: local smoothness alone rewards methods that never attempt a full orbit. On a frozen 300-prompt T3Bench-derived audit, OrbitForge reconstruction attains a 359.0-degree measured median span, raises originally unsupported-bin Q10 ImageReward from 8.07 to 16.36 relative to MedianGS-only reconstruction, while remaining competitive with VideoMV on the coverage-quality.

[CV-11] EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

链接: https://arxiv.org/abs/2606.24797
作者: Linpeng Huang,Weixing Chen,Zexin Chen,Yang Liu,Liang Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evaluation protocol in which each QA pair is explicitly annotated with supporting temporal evidence, thereby requiring joint reasoning and precise evidence localization. EG-VQA is comprised of 2,067 videos and 11,838 QA pairs with fine-grained evidence annotations. To evaluate predicted evidence, Evidence-Grounded F1 (EG-F1) is introduced as a unified metric in which temporal alignment and semantic consistency against ground-truth evidence are jointly measured. Experimental evaluation reveals that even strong proprietary models struggle to accurately ground their predictions, exposing a fundamental discrepancy between answer correctness and faithful evidence localization. To bridge this gap, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is proposed. State-of-the-art performance is achieved among open-source models, with results competitive against proprietary systems, particularly pronounced gains are observed on reasoning-intensive tasks such as counterfactual questions. These findings demonstrate that scaling alone is insufficient for robust video understanding and that structured evidence supervision is essential for the development of more reliable and interpretable VideoQA systems.

[CV-12] Pocket-SLAM: Rendering-Area-Aware Pruning for Memory-Efficient 3DGS-SLAM ICRA

链接: https://arxiv.org/abs/2606.24796
作者: Leshu Li,Jie Peng,Yang Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026 IEEE International Conference on Robotics and Automation(ICRA)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation: memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-area-aware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics such as opacity or gradient magnitude. This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2 times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-area-aware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at this https URL.

[CV-13] Counting Trees from Satellite Imagery with Noisy Supervision

链接: https://arxiv.org/abs/2606.24786
作者: Dimitri Gominski,Maurice Mugabowindekwe,Qiue Xu,Xiaowei Tong,Martin Brandt,Hieu Le,Rasmus Fensholt,Dimitris Samaras,Loic Landrieu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Counting individual trees is a fundamental task for environmental monitoring, yet remains largely unexplored with satellite imagery. At these resolutions, isolated trees may still be identifiable, but crown boundaries become ambiguous in dense forests, making the notion of an individual tree inherently ill-defined. Moreover, large-scale manual annotations of individual trees are prohibitively expensive. While scalable supervision can be derived from airborne LiDAR, the resulting annotations are noisy and difficult to exploit effectively. We address these challenges by formulating tree counting as a spatial density matching problem supervised through Unbalanced Optimal Transport. This formulation naturally accommodates both precise localization of isolate trees and robust density estimation in dense forests. We further introduce a self-correction mechanism that leverages transport residuals to progressively refine noisy supervision during training. We evaluate our approach on TinyTrees, a new benchmark spanning three continents and three satellite sensors, comprising over 215 million tree annotations (including 773K manually verified instances) across 23,000 this http URL. Our method consistently outperforms detection-based, regression-based, and transport-based distribution-matching baselines, demonstrating the effectiveness of unbalanced transport and reliability-aware supervision for large-scale tree counting from satellite imagery. Code, data and models are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.24786 [cs.CV] (or arXiv:2606.24786v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.24786 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-14] AerialFusionMapNet: Online HD Map Construction with Aerial-Onboard BEV Fusion ITSC

链接: https://arxiv.org/abs/2606.24784
作者: Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026

点击查看摘要

Abstract:High-resolution aerial imagery has recently emerged as a complementary modality for automated driving perception and has shown potential to improve birds-eye-view (BEV) scene understanding when fused with onboard sensors. Prior work demonstrated performance gains for online high-definition (HD) map construction through aerial-onboard fusion; however, conventional end-to-end fusion does not fully exploit the structural information contained in aerial representations. In this work, we introduce AerialFusionMapNet, a fusion-based mapping framework with a structured two-stage training strategy that explicitly enhances the contribution of aerial features within a unified pipeline. The proposed training scheme enables more effective integration of structural aerial priors. On the nuScenes geographic split, AerialFusionMapNet achieves up to 54.7 mAP, improving over prior aerial-onboard fusion baselines from 48.8 mAP by +5.9 absolute and +12.1% relative. The results suggest that structured training design, rather than increased architectural complexity, plays a more decisive role in unlocking the full potential of aerial imagery for online HD map construction. Code and trained models are available at this https URL.

[CV-15] Revealing Training Data Exposure in Vision Language Large Models via Parameter Gradients

链接: https://arxiv.org/abs/2606.24774
作者: Zhihao Zhu,Hongyi Tang,Yi Yang,Ahmed Abbasi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Large Models (VLLMs) trained on massive crawled corpora raise pressing copyright and data-provenance concerns. These concerns are particularly acute in healthcare, where patient medical images paired with clinical reports demand rigorous privacy safeguards. However, existing training data detection methods either fail in cross-modal scenarios or rely on superficial output signals with insufficient discriminative power. We introduce GradAudit, a gradient-based auditing framework that examines internal optimization dynamics rather than treating VLLMs as black boxes. Our approach builds on a key observation: model parameters converge to regions where gradients on training samples become stable and well-aligned, whereas gradients on non-training samples remain noisy and inconsistent. By analyzing these gradient signatures, GradAudit achieves strong separability and detects genuine image-text associations learned during training, not merely individual modality membership. Empirically, across both medical and general-domain datasets, GradAudit substantially outperforms state-of-the-art baselines in both pretraining and fine-tuning VLLMs. In a case study employing copyrighted content, we show that existing training data detection methods not only underestimate the extent of unauthorized data usage, but that this underestimation becomes more pronounced as models become more recent and more advanced.

[CV-16] nsorion: A Tensor-Aware Generalization of the Muon Optimizer

链接: https://arxiv.org/abs/2606.25975
作者: Vladimir Bogachev,Vladimir Aletov,Alexander Molozhavenko,Sergei Kudriashov,Maxim Rakhuba
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Common first-order optimizers, such as Adam, implicitly treat each parameter block as an unstructured vector, which disregards the multilinear weight structure present in many modern machine learning models. Recent work has shown that exploiting matrix structure can improve optimization dynamics. A notable example is Muon, which performs steepest descent under the spectral norm constraint. We take the next step and introduce Tensorion, a tensor-aware optimizer that extends Muon’s constrained optimization perspective from matrices to higher-order tensors. Tensorion is built around a linear minimization oracle (LMO) over a tensor norm ball. The norm is carefully chosen to balance two objectives: tightly bounding the tensor spectral norm, while still keeping the LMO tractable. This LMO becomes computable because it reduces to operations on adaptively selected unfolding matrices. Notably, when restricted to order-2 tensors (i.e., matrices), Tensorion recovers Muon exactly. Experiments on tensor-based computer vision problems suggest that Tensorion can offer improved convergence behavior and more stable gradient updates compared with Adam-based and existing tensor-aware baselines in the evaluated settings.

[CV-17] A Benchmark for Heterogeneous Stereo Deblurring with Physically- and Epipolar-constrained Cross Attention

链接: https://arxiv.org/abs/2606.25962
作者: Hoju Shin,Jiah Kim,Seung-Wook Kim,Seowon Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern stereo-capable smartphones enable immersive XR content capture. However, hardware heterogeneity across camera modules often causes severe asymmetric blur artifacts. Existing methods and benchmarks largely assume homogeneous stereo setups and therefore do not explicitly address such asymmetric degradation. To bridge this gap, we present a dedicated framework for heterogeneous stereo deblurring. First, we introduce the heterogeneous stereo deblurring (HSD) dataset, constructed from real smartphone stereo captures via multi-frame integration. Second, we propose physically- and epipolar-constrained cross attention (PECA), a lightweight module that restricts cross-view matching to an epipolar search window bounded by a optics-derived disparity upper bound. By enforcing physically valid disparity constraints, PECA enables efficient and reliable cross-view feature fusion. Moreover, our confidence-weighted attention with residual fusion emphasizes cross-guided deblurring when correspondences are reliable, while naturally falling back to self-deblurring in occluded or unreliable regions. PECA is architecture-agnostic and consistently improves CNN-, Transformer-, and NAFNet-based baselines. Extensive experiments on HSD show that PECA-enhanced models achieve improved restoration performance with favorable efficiency.

[CV-18] Pulmonary Embolism Risk Stratification from CTPA and Medical Records: Vascular Graphs Are Not All You Need MICCAI2026

链接: https://arxiv.org/abs/2606.25956
作者: Nathan Painchaud,Tristan Habémont,Morgane des Ligneris,Allan Serva,Pierre Croisille,Laurent Bertoletti,Thomas Lampert,Johannes F. Lutzeyer,Odyssée Merveille
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 8 1/2 pages + 2 pages of references. Accepted for MICCAI 2026. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in, and available online at, the external reference provided below

点击查看摘要

Abstract:Risk stratification for pulmonary embolism (PE) is critical for clinical decision-making. Stratification guidelines are based on patient medical records, parameters measured from computed tomography pulmonary angiography (CTPA), and blood tests. However, blood tests are often missing in routine practice. This work studies whether state-of-the-art models can accurately classify risk stratification from only medical records and biomarkers extracted from CTPA images. We benchmark different approaches to combine medical records and cardiac biomarkers with rich pulmonary vascular information; we add vascular biomarkers to tabular models and apply graph neural networks (GNNs) on the vascular tree’s intrinsic graph representation. We use a private dataset (n=353) with uniquely complete data for PE risk stratification. Our results show that, among global features, medical records and cardiac biomarkers are the most significant predictors, while vascular biomarkers do not further improve stratification. Even more surprising, even GNNs on vascular graphs fail to outperform strong tabular baseline on global features. We consider hypotheses, on both models and data, that could explain this suboptimal performance. Our investigation suggests that, counter-intuitively, vascular graphs might hold no discriminative information for PE risk stratification. Code is available from this https URL.

[CV-19] DSP-SLAM: A Unified Framework for Multi-Class High-Fidelity Object SLAM in the Wild

链接: https://arxiv.org/abs/2606.25953
作者: Ahmad Kourani,Ghina Daoud,Daniel Asmar,Imad Elhajj
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Existing object-aware SLAM systems force a trade-off between real-time performance, multi-class support, and the generation of high-fidelity, semantically coherent object models. To address this trade-off, we present DSP-SLAM++, which extends the DSP-SLAM framework with an asynchronous mapping pipeline for real-time performance and dedicated sensor fusion adaptations for a monocular fisheye-LiDAR suite. Experiments demonstrate that our system generates fine-grained, geometrically-complete shapes for multiple object classes while eliminating severe mapping thread bottlenecks by reducing maximum object processing latency by up to 70% compared to the state-of-the-art baseline, enabling robust, real-time performance on a challenging 25 Hz multi-class datasets. This work makes high-fidelity, multi-class object SLAM more practical for real-world applications like autonomous driving and robotic manipulation by enabling its use on platforms with common fisheye-LiDAR sensor setups. The open-source code is available at: [this http URL].

[CV-20] FunPiQ: A New Benchmark for Pixel-Level Quality Assessment in Fundus Images MICCAI2026

链接: https://arxiv.org/abs/2606.25915
作者: Pengwei Wang,José Morano,Virginia Mares,Hrvoje Bogunović
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026 main conference. Our code, weights, and dataset are available at this https URL

点击查看摘要

Abstract:Color fundus photography (CFP) is the most common ophthalmic imaging modality for large-scale screening. However, it is highly susceptible to degradations, making robust fundus image quality assessment (FIQA) crucial. The criteria for what constitutes high-quality at the image level vary across clinical tasks, making FIQA dependent on expert knowledge. This motivated the development of automated methods and datasets. While existing datasets aim to standardize image-level quality, their criteria often differ. Furthermore, image-level labels preclude the quantitative evaluation of localized degradations, which is essential for trustworthy FIQA. We argue that pixel-level FIQA based on anatomical visibility represents a more task-agnostic, explainable approach. In this work, we introduce FunPiQ, the first FIQA benchmark to provide pixel-level quality annotations. In addition, we propose EFIQA-CP, an explainable-by-design (EBD) method that uses quality pseudo-labels based on anatomical visibility to train a CNN via Non-Negative Positive-Unlabeled learning. Extensive evaluations of classification methods with post-hoc explanations, anomaly detection methods, and EBD methods demonstrate the superior performance of the last and, particularly, of EFIQA-CP.

[CV-21] In-context Region-based Drag : Drag Any Region to Any Shape ECCV2026

链接: https://arxiv.org/abs/2606.25907
作者: Jiacheng Sui,Tianyu Hao,Bingjie Gao,Li Niu,Guangtao Zhai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Dataset, code, and model are available at this https URL

点击查看摘要

Abstract:Diffusion models have shown promise in drag-style editing. Previous works mainly focus on point-based drag, which is inherently ambiguous. This paper focuses on region-based drag and introduces a novel In-Context Region-based Drag (ICRDrag) method. Under the in-context learning framework, ICRDrag consumes a source image, a source region mask, and a target region mask, producing the target dragged image. Built upon the basic in-context learning model, we introduce two novel attention regularization: 1) image-mask attention consistency to ensure that a target region attends to similar source regions for image and mask modalities; 2) source-target attention correspondence to ensure the mutual correspondence between source and target regions. To facilitate region-based drag, we also construct Paired Region Dataset (PRD), a large-scale dataset with paired masks and images. Extensive experiments show that ICRDrag significantly outperforms existing methods in both quantitative metrics and user studies, achieving superior editing accuracy and visual fidelity. The dataset, code, and model are available at this https URL.

[CV-22] OracleAnalyser: Analysing Implicit Semantics of Oracle Bone Scripts through MLLM s with Post-training

链接: https://arxiv.org/abs/2606.25906
作者: Zijia Song,Yelin Wang,Zhengyi Ma,Zitong Yu,Tianheng Wang,Jiahuan Zhang,Taorui Wang,Kaicheng Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:With the advancement of artificial intelligence, research on oracle bone scripts has entered a new era. However, existing methods and benchmarks remain largely confined to recognition tasks, overlooking the equally crucial aspect of oracle bone analysis. To address this gap, we propose OracleAnalyser, a reasoning framework for oracle bone analysis based on post-training techniques. Specifically, we fine-tune Qwen2.5-VL-3B-Instruct through multiple post-training stages and introduce a new preference optimization algorithm, Stable Focal Preference Optimization (SFPO), tailored to the characteristics of oracle bone datasets. In addition, we release both an oracle bone reasoning dataset and an oracle bone preference dataset, and further construct a new benchmark to evaluate models’ analytical capabilities for oracle bone scripts. Extensive experiments validate the superior analytical performance of OracleAnalyser, which achieves remarkable results with only 3B parameters, surpassing models with substantially larger scales.

[CV-23] SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2391 Hours of Open and Minimally Invasive Surgery

链接: https://arxiv.org/abs/2606.25905
作者: Filippos Bellos,Andre S. Gala-Garza,Miaowei Wang,Alyssa M. Hardin,Ahmad M. Hider,Yayuan Li,Jing Bi,Susan Liang,Chenliang Xu,Donald S. Likosky,Jason J. Corso
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce SurgAtlas, the largest surgical video-language dataset to date, comprising 15,291 videos (2,391 hours) spanning 18 surgical specialties and over 5,000 procedure types, sourced entirely from publicly available YouTube content. SurgAtlas is also the first surgical video-language dataset to include open surgery at scale, with 6,182 open procedure videos alongside over 9,000 minimally invasive recordings, and the first to establish standardized benchmarks for open-surgery video understanding. We additionally provide an expert-validated subset with verified visual question-answer pairs across diverse open and minimally invasive procedures, serving as a clinically grounded benchmark for surgical reasoning. Compared with existing surgical video-language datasets, SurgAtlas provides one of the most diverse annotation schemas, combining segment-level captions, step- and phase-level descriptions, video-level surgical descriptions, and reasoning-oriented question-answer pairs organized within a hierarchical taxonomy. These annotations are constructed through an automated multi-tier pipeline with LLM-based enrichment and a staged VQA generation framework with explicit groundedness verification. The scale and diversity of SurgAtlas enable training surgical foundation models with broad procedural coverage: we finetune Qwen3-VL-8B through a two-stage captioning-then-instruction pipeline and achieve competitive or state-of-the-art results on multiple established surgical benchmarks, including phase recognition, triplet detection, and reasoning question answering. More broadly, SurgAtlas provides a large native public video corpus that can support future large-scale pretraining of multimodal surgical AI systems and contribute to the development of next-generation foundation models for surgery.

[CV-24] Enhancing Brain MRI Anomaly Detection and Reasoning with ROI Rethink and Synthetic Data

链接: https://arxiv.org/abs/2606.25894
作者: Shangkun Li,Jie Xu,Yi Guo,Zeju Li,Yuanyuan Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision-language models typically generate diagnoses through single-pass inference without indicating which image regions support their conclusions. This lack of spatial grounding limits clinical utility: outputs cannot be audited, and models may hallucinate findings on normal scans. We present BrReMark (Brain Rethink via ROI Marking), a framework that introduces explicit region marking into brain MRI diagnosis. The model first generates hypotheses about potential abnormalities and grounds them through explicit bounding box marking, then verifies conclusions by re-examining the marked evidence. Training combines supervised fine-tuning on structured reasoning trajectories with reinforcement learning using a composite reward over localization accuracy and diagnostic reasoning. Furthermore, we integrate a domain randomization-based pathology synthesis augmentation strategy to improve the model’s generalizability to out-of-distribution (OOD) data. On internal benchmark, BrReMark improves mAP50 from 0.74% to 37.54% compared to the base model, while achieving 21.57% Clinical F1 and 45.26% diagnostic accuracy. On NOVA OOD benchmark, it also achieves competitive overall performance with a 45.7% reduction in false positives compared to the state-of-the-art, indicating reduced hallucination on rare pathologies. These findings suggest that explicit hypothesis-verification grounding is a practical path toward trustworthy open-ended brain MRI diagnosis across both in-distribution and OOD settings.

[CV-25] USS: Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning

链接: https://arxiv.org/abs/2606.25880
作者: Yuchen Xie,Xinyu Zhou,Kuangji Zuo,Yanshuo Lu,Fengrui Huang,Boyu Ma,Jianfei Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Embodied Visual Tracking (EVT) requires an agent to continuously follow a specified target while actively moving through dynamic environments. However, prevailing EVT paradigms predominantly rely on language-based target indication. While language is expressive and convenient, cluttered scenes often contain multiple objects that satisfy the same semantic description, leading to ambiguous target grounding. We therefore propose a paradigm shift, reframing target indication in EVT from text-only specification to unified spatial-semantic prompting. Based on this paradigm, we introduce Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning, USS, an end-to-end embodied tracking framework that supports text, point, bounding box, and mask prompts within a unified architecture. USS encodes heterogeneous prompts with modality-specific encoders, fuses prompt tokens with visual features through hybrid attention, and decodes compact prompt-conditioned representations into egocentric waypoints. To further improve temporal robustness, USS incorporates a latent world model that predicts future representations through self-supervised alignment. Real-robot experiments demonstrate that explicit spatial target cues yield higher success rates than text-only prompts, particularly in scenarios involving similar distractors and longer-horizon tracking where maintaining instance-level target identity is critical. In the simulation benchmark, USS also achieves state-of-the-art performance among non-MLLM-based methods and competitive results against recent MLLM-based approaches with faster inference speed. Our findings reveal that spatial-semantic prompting provides a more precise and flexible target indication interface for embodied visual tracking. Project site: this https URL.

[CV-26] ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering ECCV2026

链接: https://arxiv.org/abs/2606.24602
作者: Zhentao Guo,Chen Duan,Tongkun Guan,Zining Wang,Kai Zhou,Pengfei Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026

点击查看摘要

Abstract:Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across multiple frames. This perception challenge fundamentally differs from static image text understanding, yet existing datasets fail to capture: the vast majority of questions remain answerable from single frames, inadequately reflecting real-world video text comprehension demands. To address this, we present ViTexQA, a large-scale video-text QA dataset, and FrameThinker for robust multi-frame temporal reasoning. We build ViTexQA via a quality-controlled Chain-of-Thought (CoT) annotation pipeline boosted with temporal constraints; all its QA pairs demand cross-frame text fusion to solve, enforcing true temporal reliance. FrameThinker adopts two-stage training for explicit temporal modeling: CoT-Guided Supervised Fine-Tuning (SFT) generates frame-aware reasoning chains, followed by Temporally-grounded Reinforcement Learning (RL) optimized with multi-frame coherence rewards. Evaluations show our method outperforms SOTA baselines on ViTexQA, lifting ROUGE-L by 6.3%.

[CV-27] EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

链接: https://arxiv.org/abs/2606.24586
作者: Nahuel Gonzalez,Marta Robledo-Moreno,Ivan DeAndres-Tame,Ruben Vera-Rodriguez,Ruben Tolosana
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate (EER). This paper introduces EERLoss: a subdifferentiable, arbitrarily accurate approximation to EER for training deep biometric models. Furthermore, this framework has the potential to be adapted to optimize any specific operating point on the DET curve, enhancing its generalizability. To validate this approach, EERLoss is evaluated on a particularly demanding behavioral biometric modality: keystroke dynamics verification. This task is characterized by its high intra-class and low inter-class variability. Experiments are conducted on the large-scale KVC-onGoing benchmark, incorporating data from over 185,000 subjects across different scenarios. A comprehensive ablation study initially demonstrates the superiority of EERLoss in comparison to existing state-of-the-art loss functions. It also converges substantially faster compared to other losses, reducing the overall training cost. Additionally, a comparison is made between the proposed loss and the KVC-winning architecture by re-training it with EERLoss, demonstrating that the proposed approach significantly outperforms the original SoTA, achieving a relative EER reduction of up to approx. 30%. This improvement on a challenging, large-scale benchmark validates the effectiveness of EERLoss as a task-aligned training objective specifically suited for high-variance biometric traits.

[CV-28] Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

链接: https://arxiv.org/abs/2606.24570
作者: Julien Khlaut,Charles Corbière,Baptiste Callard,Amaury Prat,Leo Butsanets,Antoine Saporta,Théo Danielou,Leo Machado,Korentin Le Floch,Tom Boeken,Pierre Manceron,Corentin Dancette
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language contrastive pretraining has become the dominant recipe for 3D medical foundation models, leveraging the large volumes of paired scans and reports produced in clinical practice. However, medical images usually span dozens of organs, and radiological reports are much longer than typical natural image captions and are composed of multiple structured sections. CLIP-style pretraining compresses this structure by encoding each modality into a single global token, at the risk of losing important details. We introduce ConQuer (Concept Queries), an image-text pretraining method that augments CLIP’s global alignment with a set of localized alignments, one per concept. ConQuer splits the report into concept-specific sections and learns cross-attention queries that pool the matching image features without using any segmentation mask or spatial supervision. Contrastive learning is then applied independently for each concept. Concepts can be any unit of semantic localization; here, they are anatomical regions, one query per organ or gross body region. As a byproduct, each query learns attention maps focused on its concept, providing built-in spatial interpretability. We use ConQuer to train Jolia, a 3D CT foundation model on chest and abdominal CT. Jolia consistently outperforms a CLIP baseline on findings classification, report generation, and cross-center transfer, and sets a new state of the art across multiple public benchmarks. Jolia’s weights are available at this https URL

[CV-29] Multilevel Stochastic Plug-and-Play for Sparse-View CT Reconstruction

链接: https://arxiv.org/abs/2606.24567
作者: Antoine De Paepe,Alexandre Bousse,Dimitris Visvikis
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 12 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Sparse-view computed tomography (SVCT) reduces radiation exposure and acquisition time, but the limited number of projection views makes the reconstruction problem severely ill-posed and leads to streak artifacts when analytical methods are used. Plug-and-Play (PnP) methods provide an effective way to combine data fidelity with learned image priors, while stochastic PnP methods further improve robustness by matching the denoiser input distribution through re-noising. However, these methods often require many iterations to converge, which limits their practical efficiency. In this work, we propose a multilevel (ML) stochastic PnP method for SVCT that accelerates stochastic PnP reconstruction. We highlight that, in the stochastic setting, directly enforcing prior coherence across levels would require accurately estimating fine-level prior gradients through multiple denoiser function evaluations, which substantially increases the computational cost. Motivated by this observation, we perform the multilevel steps in multiresolution analysis (MRA) approximation spaces. This choice is supported by the structure of the wavelet decomposition, which causes the prior-coherence correction to vanish in expectation, thereby avoiding costly estimation of fine-level stochastic prior gradients for the coarse-level corrections. Experiments on SVCT reconstruction show that our method, called Multilevel Stochastic Plug-and-Play (ML-SPnP), achieves reconstruction quality comparable to state-of-the-art methods while substantially reducing runtime.

[CV-30] PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments

链接: https://arxiv.org/abs/2606.24564
作者: Zhenyang Li,Lutao Jiang,Yizhou Zhao,Ying-Cong Chen,Xin Wang,Weikai Chen,Yifan Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Reconstructing realistic, physically plausible garments from a single image remains a fundamental challenge. Template-free methods capture surface geometry but lack explicit sewing structure for simulation; while programmatic systems are simulation-ready but constrained by predefined templates. This reveals a fundamental representation gap between geometric reconstruction and structured garment construction. We present PatternGSL, a structured garment representation in the form of a template-free and learnable specification language that encodes complete sewing patterns, including panel boundaries, parameterized seams, and explicit stitch topology, in a compact and standardized form. PatternGSL preserves the physical rigor of pattern-based models while removing template dependence, elevating sewing structure as a first-class target for generative modeling. We further propose a vision-language framework that predicts PatternGSL specifications directly from a single image and decodes them into garments using lightweight deterministic validity handling, without optimization-based refinement or manual cleanup. In addition, we introduce PatternGSLData, the first large-scale image-to-GSL paired dataset comprising 300K samples with complete sewing pattern annotations, enabling supervised VLM training for structured garment reconstruction. Experiments demonstrate improved pattern accuracy over prior baselines, explicit sewing-structure recovery, reliable cloth simulation, and pattern-level editing through the same deterministic decoding pipeline. Code and data-processing scripts will be released at this https URL.

[CV-31] Quantum CT via Dynamic Interval Encoding and Prior-Balanced QUBO Reconstruction

链接: https://arxiv.org/abs/2606.24561
作者: Ao Wang,Yikuang Yuluo,Yujie Liu,Shuangyang Zhong,Yuwen Zhang,Zihao Wang,Fenglin Liu,Andreas Maier,Haijun Yu,Yixing Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Quadratic unconstrained binary optimization (QUBO)-based quantum computed tomography (CT) casts reconstruction as a binary quadratic problem for quantum annealing and hybrid quantum–classical solvers. For grayscale CT, however, image encoding is constrained by the binary-variable budget: fixed global bit-plane encodings increase QUBO size and coupling complexity as gray-level precision improves, whereas low-bit encodings introduce quantization error. We propose a QUBO-based grayscale CT reconstruction framework that combines dynamic interval encoding with prior-balanced optimization. Each refinement round encodes active pixels only within local gray-level intervals around the current estimate, and a boundary-hit-guided update rule adaptively switches between search expansion and local refinement. To improve optimization stability, the method balances projection-domain data consistency and an edge-preserving quadratic prior before forming the final QUBO. Sparse-view and limited-angle fan-beam CT experiments show that the proposed method recovers structures and gray-level distributions more faithfully than the evaluated analytic, iterative, variational, and representation-based baselines. Expressivity analysis and ablation studies further indicate that the improvement mainly arises from effective gray-level representation through dynamic local encoding and more stable data-fidelity–prior coupling. Experiments on the D-Wave hybrid binary quadratic model (BQM) solver further demonstrate that the formulation is executable on a hardware-backed hybrid quantum–classical backend.

[CV-32] Heterogeneous Knowledge Distillation via Geometry Decoupling and Momentum-Aware Gradient Regulation

链接: https://arxiv.org/abs/2606.24557
作者: Wuming Yang,Xiang Zhang,Hongmin Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Heterogeneous Knowledge Distillation (HKD) aims to transfer knowledge across varying architectures (e.g., from Transformer to CNN) but inherently suffers from severe training instability. We reveal that this instability stems from two highly coupled challenges: massive feature norm discrepancies that cause optimization drag, and severe gradient conflicts between the primary and distillation objectives arising from distinct inductive biases. To achieve stable distillation, we propose SPOFA, a framework built upon a novel Feature and Gradient Dual Stabilization mechanism. Specifically, at the feature level, we introduce a LayerNorm-based decoupling projector that explicitly decouples feature magnitude from direction, creating a bounded and stable space for semantic alignment. At the gradient level, we propose a momentum-driven Exponential Moving Average (MEMA) dynamic scaler. By establishing a robust historical baseline of the optimization trajectory, MEMA actively evaluates instantaneous gradient conflicts and adaptively penalizes harmful distillation signals, guaranteeing stable convergence. Importantly, SPOFA achieves this dual stabilization with an extremely lightweight parameter footprint. Extensive experiments on two mainstream benchmarks demonstrate that SPOFA achieves state-of-the-art accuracy, significantly outperforming computationally expensive methods while introducing only minimal computational overhead compared to standard baselines.

[CV-33] Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

链接: https://arxiv.org/abs/2606.24548
作者: Jiayi Lei,Yuandong Pu,Xingyu Han,Rongpeng Zhu,Jing Xu,Jinyao Wang,Zijian Zhou,Bin Fu,Yuewen Cao,Yihao Liu,Yongsheng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures. Project page: this https URL

点击查看摘要

Abstract:Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell’s inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a models’ ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.

[CV-34] PointVG-R: Internalizing Geometric Reasoning in MLLM s for Precise Pointing Localization via Visual Chain of Thought

链接: https://arxiv.org/abs/2606.24539
作者: Ling Li,Bowen Liu,Zinuo Zhan,Jianhui Zhong,Ziyu Zhu,Bingcai Wei,Kenglun Chang,Zhidong Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pointing-based visual grounding requires models to precisely locate target objects by deciphering complex spatial relationships between the visual scene and pointing gestures. Traditional methods typically encode input images into static feature representations and perform reasoning primarily within the linguistic domain, often overlooking the rich perceptual cues and explicit spatial geometry inherent in images. In this study, we aim to mitigate the cognitive vulnerability of models in interpreting gestural spatial relations by proposing PointVG-R, a reasoning-guided Multi-modal Large Language Model (MLLM). PointVG-R introduces geometric-aware reasoning for pointing-based grounding, enabling the model to think with images through the strategic integration of Reinforcement Learning (RL) and cold-start data. Specifically, we design a novel geometric reasoning pipeline that simulates the iterative cognitive process humans employ when interpreting pointing gestures. Furthermore, we construct EgoPoint-CoT, a high-quality visual Chain-of-Thought (CoT) dataset featuring detailed reasoning trajectories to guide the model via Supervised Fine-Tuning (SFT) and RL. To address the varying quality of learning signals encountered during training, we further propose an Adaptive Importance Weighting strategy based on Group Variance, which dynamically adjusts reward signals to optimize the learning process. Experimental results demonstrate that PointVG-R achieves SOTA performance, outperforming the baseline by \textbf15.86 points in mIoU. Extensive ablation studies further validate the efficacy of our proposed modules. Code: this https URL.

[CV-35] ForensicsTok: Forensics-Guided Tokenized Modeling for Image Tampering Localization

链接: https://arxiv.org/abs/2606.24538
作者: Lei Xu,Haowei Wang,Shen Chen,Taiping Yao,Bin Li,Changsheng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) offer powerful reasoning for forensic tasks, yet existing approaches utilizing exogenous segmentation decoders often suffer from suboptimal localization. The reliance on stitched pipelines introduces information bottlenecks during backpropagation, which dilutes spatial signals and is limited by semantic priors of the segmentor. To address these limitations, we propose ForensicsTok, which reformulates image manipulation localization as an autoregressive sequence generation task. ForensicsTok directly generates spatially grounded token sequences, enabling precise mask prediction without intermediary supervision. Specifically, we introduce a Token Splatting Decoder (TSD) to map tokens to binary masks via codebook-aware code smoothing, which mitigates sharp gradients from deterministic detokenizers. Furthermore, to capture diverse tampering clues, we propose a Hierarchical Expert Fusion (HEF) module that injects multi-scale features from a forensic expert model. This unified architecture effectively compensates for the lack of forensic priors in standard MLLMs. Extensive experiments on six benchmarks show that ForensicsTok substantially improves over existing MLLM-based baselines and slightly improves over strong forensic expert baselines, while exhibiting stronger robustness to perturbations.

[CV-36] VisCritic: Visual State Comparison as Process Reward for GUI Agents ECCV2026

链接: https://arxiv.org/abs/2606.24525
作者: Jiachen Qian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures; ECCV 2026 submission; supplementary material uploaded as ancillary file

点击查看摘要

Abstract:GUI agents powered by vision-language models show strong potential for automating digital tasks, yet frequently fail in long-horizon scenarios due to the absence of step-level verification. Existing process reward models verify actions through textual reasoning alone, missing the visual nature of GUI state changes. We introduce VisCritic, a visual process reward framework that verifies agent actions by directly comparing pre-action and post-action screenshots in visual feature space. VisCritic employs a Siamese vision transformer to extract change-aware representations, coupled with an Action-Aware Critic Head that jointly evaluates action success, task progress, and error type. A critic-training data construction pipeline generates weakly supervised samples from existing trajectories without additional human labels for critic training. Experiments and offline analyses across five benchmarks demonstrate that VisCritic serves as a plug-and-play enhancement for diverse GUI agents, generally improving benchmark metrics while providing visual diagnostic cues.

[CV-37] What Do Flow-Based Inverse Solvers Approximate? A Posterior-Transport View

链接: https://arxiv.org/abs/2606.24516
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A growing family of training-free solvers – FlowDPS, FLOWER, PnP-Flow and their diffusion ancestors (DPS, DAPS) – repurpose a pretrained flow-matching prior to solve imaging inverse problems by adding a measurement-guidance term to the deterministic probability-flow ODE. Despite strong empirical results, what these per-step corrections actually approximate – and how far the resulting samples are from the true posterior p(x\mid y) – has not been characterized. We give a posterior-transport account of flow-based inverse problem solving. Our starting point is a simple but consequential fact: for a \emphdeterministic flow prior, Bayesian conditioning is realized entirely by a \emphreweighting of the source distribution, not by a drift correction; pushing the reweighted source through the \emphunmodified velocity field yields exact posterior samples. From this we show that trajectory-guidance solvers can be read as the minimum-kinetic-energy \emphcorrection field needed to morph the unconditional source into the posterior, and that FlowDPS / FLOWER / PnP-Flow correspond to distinct zeroth-order / Gaussian / proximal approximations of this single object; we bound the resulting posterior bias in Wasserstein distance. A controlled 2 D study with a closed-form posterior confirms the theory decisively: source reweighting matches the true posterior to the Monte-Carlo floor on every metric, whereas trajectory guidance incurs 200 – 800\times larger error and collapses posterior modes, \emphregardless of guidance strength. Guided by the analysis we propose a cheap, principled velocity-correction solver that is competitive across two in-domain priors (AFHQ, CelebA) and two out-of-distribution settings while, unlike point-estimate source-space optimizers, producing diverse posterior samples with uncertainty that correlates with reconstruction error.

[CV-38] What Does the Brain See? Multiview Neural Representations to Demystify the Brain-Visual Alignment

链接: https://arxiv.org/abs/2606.25718
作者: Salini Yadav,Taveena Lotey,Pravendra Singh,Partha Pratim Roy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot visual decoding from electroencephalography (EEG) aims to infer visual semantics from non-invasive neural recordings, but remains challenging due to the low signal-to-noise ratio, non-stationarity, and limited spatial resolution of EEG. Existing EEG-vision alignment methods often rely on holistic EEG embeddings, which can obscure the complementary temporal, spectral, and spatial structure underlying visual perception. We introduce a unified multiview EEG representation learning framework for aligning brain responses with visual semantic embeddings. Our method builds an EEG encoder that jointly models three complementary views: input-conditioned state-space temporal dynamics, learnable wavelet-based spectral decomposition for sample-adaptive frequency modeling, and attention-modulated graph learning for structured electrode interactions. The resulting multiview EEG embeddings are fused and aligned with pretrained visual representations in a shared semantic space using contrastive learning with EEG-specific regularization, enabling 200-way zero-shot visual classification. Experiments on THINGS-EEG benchmark show that our method achieves state-of-the-art performance, with 54.8% Top-1 and 85.6% Top-5 accuracy in the within-subject setting and 15.3% Top-1 and 45.4% Top-5 accuracy in the cross-subject setting. We further present the first systematic cross-session EEG-image decoding evaluation, achieving 40.8% Top-1 and 78.0% Top-5 accuracy. These results suggest that explicitly modeling multiview neural structure improves both semantic alignment and generalization in EEG-based visual decoding.

[CV-39] Falcon: Functional Assembly and Language for Compositional Reasoning in X-ray ECCV2026

链接: https://arxiv.org/abs/2606.25701
作者: Yonathan Michael,Mohamad Alansari,Natnael Takele,Andreas Henschel,Naoufel Werghi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV2026; Project Page: this https URL

点击查看摘要

Abstract:Conventional vision-language models are largely object-centric, focusing on detecting and describing individual entities. In safety-critical X-ray baggage screening, however, threat often emerges not from a single object but from the functional compatibility of spatially dispersed components, such as batteries, detonators, and explosive charges. We formalize this setting as \emphcompositional threat reasoning, where risk is modeled as a relational property of grounded regions rather than an independent detection outcome. We introduce \textbfFalcon, a multimodal framework that abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk. This structured representation is injected into the language model as an explicit intermediate interface, encouraging relationally consistent and safety-aware reasoning. To evaluate this problem, we present \textbfFalcon-X, a benchmark that unifies dense grounding with structured supervision over component completeness and risk inference in cluttered X-ray imagery. Experiments show that while existing multimodal models adapt to appearance, they struggle with compositional safety reasoning. Falcon improves functional grounding and produces more coherent threat assessments, establishing compositional safety reasoning as a distinct evaluation paradigm for multimodal systems.

[CV-40] owards a Dynamic and Fixed-budget Memory Bank for Efficient Streaming Video Understanding

链接: https://arxiv.org/abs/2606.25658
作者: Baiyang Song,Yuli Lin,Qiong Wu,Tao Chen,Jun Peng,Xiao Chen,Yiyi Zhou,Rongrong Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, streaming video understanding is still a daunting task for existing \emphmultimodal large language models (MLLMs). Its difficulties not only lie in handling the ever-increasing video frames, but also in the unpredictability of future video content and input instructions. In this paper, we study this task from the perspective of constructing a dynamic but fixed-budget memory bank, and propose a novel and training-free approach termed \emph\textbfCausalMem. CausalMem is dedicated to constructing a dynamic visual memory update mechanism, thereby maximizing the amount of information in streaming video within a limited memory space, much like the human brain. In practice, CausalMem estimates the redundancy of visual tokens and updates the memory bank via an online semantic basis, which models the principal semantics of the observed video stream. To validate CausalMem, we apply it to two representative MLLMs, namely LLaVA-OneVision and Qwen2.5-VL respectively, and conduct extensive experiments on both streaming and offline video understanding benchmarks. The experimental results not only show the great advantages than existing methods under both streaming and offline settings, \emphe.g., +3.2% and +3.0% average accuracy gains respectively, but also witness the superior semantic preservation for streaming videos, \emphe.g., using 12 k token budgets to memorize hour-long streaming videos, which achieves more than \textbf20 \times visual token compression ratio and only occupies about \textbf82 MB storage. \textbfOur code is given in \hrefthis https URLCausalMem.

[CV-41] Steering Vision-Language Models with Joint Sparse Autoencoders

链接: https://arxiv.org/abs/2606.25657
作者: Huizhen Shu,Xuying Li,Hongxu Lin,Wenjie Sun,Hui Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19pages,10 figures

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have shown promise for analyzing language models, but applying them to vision-language models (VLMs) often yields representations that are difficult to use as controllable cross-modal steering directions. We introduce the Joint Sparse Autoencoder (JSAE), which uses an explicit alignment constraint to jointly factorize sequence-pooled vision and language activations into shared, interpretable image/caption-level features. Applied to LLaVA, JSAE recovers cross-modal features for recognizable concepts (e.g., food and animals). Through bidirectional interventions (additive steering and suppression), we observe a layer-dependent asymmetry under our protocol: additive steering peaks at mid-to-late (pre-output) layers and weakens at both ends, whereas suppression scores remain within a comparable range across all probed layers within statistical noise. Experiments on three VLMs, namely LLaVA-v1.6-Mistral-7B, Llama3-LLaVA-8B, and the MoE-based Qwen3-VL-30B, show related layer-localized effects across architectures. Together, these results suggest that explicitly aligned sparse representations support more controllable intervention-based analysis of multimodal features, within an identifiable layer range, than the unconstrained alternatives tested here.

[CV-42] Auto-Labelling-Based Domain Transfer for 3D Object Detection on a Bicycle-Mounted LiDAR Platform

链接: https://arxiv.org/abs/2606.25652
作者: Mario Finkbeiner,Max A. Buettner,Kanak Mazumder,Fabian B. Flohr
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable 3D perception of vulnerable road users (VRUs) such as cyclists and pedestrians is essential for their safety in urban traffic and a core requirement for autonomous driving (AD). Alongside advances in vehicle-based perception, research increasingly equips bicycles with sensors to study traffic from a perspective native to VRUs. Such platforms still rely on LiDAR detectors originally trained on vehicle data, yet annotated 3D data from a cyclist’s perspective is scarce. How well these detectors generalise to this setting has not been evaluated. We present a 3D object detection benchmark of 1,027 annotated LiDAR keyframes (over 18,000 3D bounding boxes) from the FUSE-Bike platform in urban Munich. We evaluate four nuScenes-pre-trained detectors against 1,854 human-verified ground-truth (GT) boxes both in their original form and after finetuning on training labels produced by a VRU-dedicated auto-labelling pipeline that requires no manual annotation. The zero-shot domain gap is concentrated on the VRU classes. Finetuning recovers most of it, improving mean average precision (mAP) by up to 23.4 points with the largest gains on pedestrians and cyclists, and the adapted detectors even surpass the quality of the auto-labels they were trained on. The benchmark provides a reproducible baseline for VRU-centric 3D detection and shows that auto-labels are a viable substitute for manual annotation when adapting vehicle-trained detectors to a cyclist platform.

[CV-43] Calousel: Extrinsic Calibration of Non-overlapping Multi-camera Systems from Pure Rotation IROS2026

链接: https://arxiv.org/abs/2606.25646
作者: Gwanhyeong Song,Chaehyeon Song,Ayoung Kim
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2026. 8 pages, 7 figures

点击查看摘要

Abstract:Extrinsic calibration of multi-camera systems with non-overlapping FOVs has been a challenging problem in the robotics literature. Conventional target-based methods impose substantial target setup overhead, either deploying large calibration targets or requiring pre-measured multi-target poses. Motion-based approaches instead suffer from drift error, scale ambiguity, and motion degeneracy. Securing both accuracy and usability, we propose a novel calibration method that leverages pure rotational motion, requiring only a single static calibration board. The key idea is to make all cameras sequentially observe the same target under a shared geometric reference, even without overlapping views. To integrate these time-separated observations, we formulate the problem using a latent turntable frame and a 3D error on SE(3) within a global optimization framework. We validate the proposed method on both a controlled camera rig and a full-scale vehicle platform with heterogeneous cameras, and analyze robustness under non-ideal turntable motion. Extensive experiments show that our approach maintains competitive accuracy without specialized precision hardware, proving its strong suitability for realistic on-site deployments. Our code is publicly available here.

[CV-44] SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity ECCV

链接: https://arxiv.org/abs/2606.25634
作者: Tianchen Guo,Chen Liu,Ling Chen,Xin Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: European Conference on Computer Vision (ECCV). 32 pages, 10 figures. The code is available at: $ \href{ this https URL }{\text{SSMNBench}} $

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable progress in single-image perception, yet their ability to reason about complex cross-view human-centric scenes remains largely unverified. Current multi-view benchmarks evaluate models using a fixed “bag of frames” and thus conflate a model’s robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence. To address this issue, we introduce SSMNBench, a diagnostic benchmark comprising 3,300 curated QA pairs for cross-view human and human-object understanding. SSMNBench uniquely categorizes tasks into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). By systematically perturbing view availability across 17 state-of-the-art MLLMs, critical limitations are revealed: models suffer from severe “distraction degradation” when presented with redundant views (SVS), and fail to integrate fragmented geometric evidence across cameras (MVN). Our evaluations demonstrate that modern MLLMs rely on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis. By exposing these fundamental vulnerabilities, SSMNBench provides a rigorous diagnostic framework to drive the advancement of future cross-view-aware multimodal architectures. The code is available at: \hrefthis https URL\textSSMNBench

[CV-45] 1000 Rallies: An Event-Camera Dataset and Real-Time Learned Ball-State Estimation for Robotic Table Tennis

链接: https://arxiv.org/abs/2606.25620
作者: Raphaela Kreiser,Asude Aydin,Yin Bi,Claudio Fanconi,Peter Dürr,Naoya Takahashi
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Robotic table tennis has emerged as a compelling benchmark for real-time robotic perception due to its fast ball dynamics and stringent timing requirements. Accurate, high-frequency, and low-latency ball state estimation is critical for reliable trajectory prediction and timely control. Traditional frame-based cameras face an inherent trade-off: low frame rates leave temporal blind spots that miss fast-moving objects and high frame rates raise data and computational cost. Event cameras instead offer microsecond temporal resolution and, under sufficient illumination, remain largely free of motion blur even at high ball speeds. However, the community lacks large-scale datasets to develop and benchmark event-based perception in realistic sports scenarios. We address this gap by introducing the first large-scale event-camera dataset for table tennis, comprising over 1000 rallies from a diverse group of players ranging from amateurs to elite-level athletes. Each recording captures the event stream alongside 14 synchronized high-speed frame-based cameras at 200 FPS, which we use to produce 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. Building on this dataset, we train a convolutional neural network robust to background player motion that jointly estimates the ball’s position and velocity in the image-plane from events. Treating the predicted velocity as an additional measurement in the Kalman filter reduces bounce-point prediction error by 36% relative to a position-only baseline. Finally, we close the perception-action loop by integrating the event-based system with a Stäubli robotic arm, enabling the first real-time human-robot table tennis rallies driven by event-based perception.

[CV-46] ScaleHP: Estimating Hand Pose in Metric Space

链接: https://arxiv.org/abs/2606.25619
作者: Ruitao Jing,Xingyu Chen,Hongyang Li,Qing Jiang,Yukai Shi,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 8 figures, 6 tables; includes supplementary material

点击查看摘要

Abstract:Accurate metric-space hand pose estimation (HPE) is essential for immersive human-computer interaction and robotics. However, most existing methods predict poses in a root-relative coordinate system and cannot estimate the hand in absolute metric scale. In this work, we observe that the intrinsic proportional relationships among human hand bones encode stable anthropometric priors that implicitly correlate with the overall metric size of the hand. Leveraging this insight, we present ScaleHP, an end-to-end one-stage hand pose estimation framework that bypasses fragile extrinsic depth modules to recover the hand in metric space. ScaleHP employs a transformer-based decoder with a novel scale token to fuse multi-scale morphological and appearance features. By solving for metric coordinates through a perspective-constrained least-squares approach, we achieve high-precision pose estimation in the camera coordinate system. ScaleHP delivers state-of-the-art performance, including 35.8 CS-MPJPE on FreiHand and 4.6/5.9 PA-MPJPE on DexYCB and HO3Dv3. These results demonstrate that internal biological constraints significantly reduce relative geometry and absolute metric errors, offering a robust solution for generalized, real-world hand tracking.

[CV-47] Expresso-AI: Explainable Video-Based Deep Learning Models for Depression Diagnosis

链接: https://arxiv.org/abs/2606.25606
作者: Felipe Moreno,Sharifa Alghowinem,Hae Won Park,Cynthia Breazeal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages. Accepted at the 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII). Code: this https URL

点击查看摘要

Abstract:Given the widespread prevalence of depression and its consequential impact on individuals and society, it is crucial to obtain objective measures for early diagnosis and intervention. As a multidisciplinary topic, these objective measures should be interpretable and accessible to health care professionals, ensuring effective collaboration and treatment planning in the realm of mental health care. Even though current automated depression diagnosis approaches improved over the last decade, a critical gap exists as they often lack affect-specificity and interpretability, limiting their practical application and potential impact on mental health care. In particular, interpretability from temporal activities from videos when deep models are used is not fully explored. In this study, we present a novel framework for analyzing Deep Neural Networks’ decisions when trained on facial videos, specifically focusing on automatic depression severity diagnosis. By fine-tuning Deep Convolutional Neural Networks (DCNN) pre-trained on Action Recognition datasets on depression severity facial videos from AVEC depression dataset, our framework is able to interpret the model’s saliency maps by examining face regions and temporal expression semantics. Our approach generates both visual and quantitative explanations for the model’s decisions, providing greater insight into its reasoning. In addition to this interpretability, our video-based modeling has improved upon previous single-face benchmarks for visual depression diagnosis, resulting in enhanced predictive performance. Overall, our work demonstrates the successful development of a framework capable of generating hypotheses from a facial model’s decisions while simultaneously improving depression’s predictive capabilities.

[CV-48] VPA-Guard: Defending and Benchmarking Image-to-Video Generation Against Visual Prompt Attacks

链接: https://arxiv.org/abs/2606.25592
作者: Yining Sun,Haoyu Kang,Jiajun Wu,Heng Zhang,Danyang Zhang,Zhenjun Zhao,Haochen Han,Fangming Liu,Wai Kin Victor Chan,Alex Jinpeng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset Page: this https URL

点击查看摘要

Abstract:Recent advancements in Image-to-Video (I2V) generation have transformed input images from simple appearance references into interactive control interfaces where visual cues such as arrows, sketches, and emojis orchestrate complex video dynamics with unprecedented controllability. However, these seemingly innocuous static cues can be interpreted by models as executable temporal instructions, unfolding into harmful actions in the generated videos. Despite the severity of this threat, existing safety benchmarks remain predominantly focused on text-based and content-only image-based jailbreaks, leaving implicit visual prompt attacks insufficiently explored. To bridge this gap, we present VVA-Bench, the first systematic benchmark for evaluating video generation safety under categorized vision-centric prompt attacks. Extensive experiments on VVA-Bench demonstrate that state-of-the-art models are highly susceptible to such attacks, with Attack Success Rates (ASR) reaching 100.0% on Wan 2.7 and 74.8% on Veo 3.1. To mitigate these risks, we propose VPA-Guard, a retrieval-augmented and self-evolving defense framework. By leveraging few-shot reasoning to identify latent malicious intents, our method reduces the attack ASR by 44.2% and the harmfulness score by 73.4% on average, while maintaining the model’s utility for legitimate user edits. Our work provides both a rigorous benchmark and an effective defense strategy to advance safe and socially responsible multimodal generation.

[CV-49] FeVOS: Foresight Expression Video Object Segmentation ECCV2026

链接: https://arxiv.org/abs/2606.25585
作者: Kehan Lan,Kaining Ying,Henghui Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Homepage: this https URL

点击查看摘要

Abstract:Existing Referring Video Object Segmentation tasks focus on referring expressions describing events, actions or appearances of relevant objects within the observed frames, lacking evaluation in scenarios that require pre-decisive spatio-temporal reasoning, thereby limiting their applicability. To address this, we propose Foresight Expression Video Object Segmentation, a task that queries future events in upcoming video segments and requires masks of the objects in the observed frames as visual answers. For example, in ego-centric scenes, the question “What tool will be used?” demands reasoning over spatio-temporal cues to predict the masks of the next tool to be used, which helps with the understanding of future actions and decisions. To support this task, we introduce FeVOS, a dataset with 968 video clips, 14,525 foresight expressions, and 2,904 chain-of-thought annotations to provide explicit and interpretable reasoning steps. We further develop FeVOS-R1, an MLLM-based model trained on our dataset via a two-stage pipeline of supervised fine-tuning and reinforcement learning. FeVOS-R1 not only achieves state-of-the-art performance on FeVOS, but also demonstrates strong generalization to existing RVOS benchmarks. We hope this work can inspire more research on predictive reasoning in video perception.

[CV-50] H-Adapter: Pose-Robust Hairstyle Transfer via Attention-Derived Source-Aligned Hair Masks ECCV2026

链接: https://arxiv.org/abs/2606.25578
作者: Seulgi Jeong,Yunseong Cho,Sanghun Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026. Project page: this https URL

点击查看摘要

Abstract:Hairstyle transfer has practical applications such as virtual try-on, yet remains challenging when the source and reference exhibit large head-pose discrepancies. We propose H-Adapter, which improves pose robustness by training with a region-specific loss that disentangles hair and non-hair objectives and thereby induces spatially disentangled cross-attention, from which a source-aligned hair edit mask is derived to guide diffusion-based inpainting. Experiments on pose-agnostic and pose-different subsets demonstrate strong quantitative results, including the best FID, \mathrmFID_\mathrmCLIP , and CLIP-I under pose differences, while maintaining competitive non-hair preservation and improving qualitative fidelity to fine-grained reference hairstyle details. Beyond source-conditioned transfer, H-Adapter supports practical extensions including text-to-image generation, auxiliary prompt-based hair color control, and compatibility with an identity-preserving IP-Adapter variant. We also introduce a VLM-as-a-judge protocol and observe consistent gains in hairstyle faithfulness, non-hair preservation, and artifact quality.

[CV-51] Energy-Efficient CNN Acceleration with MSDF Digit-Serial Arithmetic on FPGA CEC

链接: https://arxiv.org/abs/2606.25562
作者: Muhammad Usman,Yousef Sadegheih,Dorit Merhof
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at 2025 32nd IEEE International Conference on Electronics, Circuits and Systems (ICECS)

点击查看摘要

Abstract:This paper presents an energy-efficient hardware acceleration of the convolutional layers in the U-Net architecture for image segmentation, implemented on FPGA. While digit-serial arithmetic, particularly most-significant-digit-first (MSDF) techniques, offers a compact hardware footprint, it suffers from initial latency before producing the first output digit. This delay accumulates in cascaded operations like multiplication followed by addition, where each unit introduces its own startup overhead. To overcome this, we propose a merged multiply-add (MMA) architecture that fuses these operations into a unified pipeline. Instead of incurring separate delays, the MMA introduces a single streamlined latency per iteration, shorter than the combined latency of conventional cascaded units, resulting in enhanced throughput and efficiency. The MMA units are designed to process spatial input depths in parallel, achieving significantly higher performance than both standalone MSDF-based and conventional designs. We evaluate the proposed design using U-Net as a target application. Despite operating at a lower frequency than a CPU, the FPGA-based accelerator achieves up to an order of magnitude higher energy efficiency, delivering up to 15.14 GOPS/W compared to 1.93 GOPS/W for CPU-based inference. The design also shows approximately 9\times reduction in energy consumption compared to MSDF-based FPGA implementations. These results highlight the efficacy of the merged arithmetic approach for resource-constrained, latency-sensitive edge applications in medical imaging and computer vision.

[CV-52] Concept Removal for Frontier Image Generative Models ICML2026

链接: https://arxiv.org/abs/2606.25548
作者: Aditya Kumar,Pierre Joly,Adam Dziedzic,Franziska Boenisch
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML2026

点击查看摘要

Abstract:Image generative models are trained on massive, largely uncurated internet-scale datasets that contain undesirable visual concepts. Efficiently removing such concepts from the model generations without degrading the quality of output images remains challenging. We introduce a novel concept removal method for frontier diffusion and image autoregressive models, such as SD3.5, Flux, and Infinity. Our intervention replaces the internal bottleneck layer present in all these modern models with a transcoder that is trained to replicate the original layer while structuring it into distinct activation features. This in-place substitution creates an integrated filter through which concept-specific signals can be selectively disabled while preserving the rest of the model’s behavior. Since the intervention modifies the model backbone rather than attaching an external component, it remains persistent under white-box access. Empirically, the approach achieves state-of-the-art concept removal performance across modern diffusion and autoregressive models, maintains visual generation quality, provides robustness against adversarial prompts, and supports sequential removal of diverse concepts. This positions our method as a practical approach for concept removal in frontier image generative models.

[CV-53] Efficient Cross-Scale Invertible Hiding Network with Spatial-Frequency Collaboration and Non-Invertible Mechanism

链接: https://arxiv.org/abs/2606.25547
作者: Junxue Yang,Xin Liao
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: IEEE TNNLS submitted by Junxue Yang, Xin Liao ( this https URL )

点击查看摘要

Abstract:Image hiding aims to conceal image-level messages within cover images at the same resolution. Invertible neural networks (INN)-based image hiding has emerged as an important branch. It treats concealing and revealing as a pair of inverse problems on image domain transformation and uses INN’s forward and backward processes to address them. Due to architectural constraints, existing INN-based methods suffer from single-scale and single-domain feature extraction and limited nonlinear representation capability, resulting in inferior image quality. To mitigate these limitations, we propose an efficient cross-scale invertible hiding network with the spatial-frequency collaboration and the non-invertible mechanism, termed CrosInv. CrosInv exploits cross-scale and spatial-frequency collaborative features while enhancing nonlinear representation. Specifically, we introduce a cross-scale invertible module that bijectively maps inputs to cross-scale representations. To effectively integrate spatial and frequency information, the cross-scale invertible module employs pixel shuffle, Haar wavelet transformation, and their inverse operations for scale transformation. Furthermore, a non-invertible cross dense module is integrated to enhance the nonlinearity. Comprehensive experiments verify the effectiveness and superiority of the proposed CrosInv.

[CV-54] Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography ICML2026

链接: https://arxiv.org/abs/2606.25546
作者: Bowen Shi,Weiwei Cao,Ruifeng Yuan,Wanxing Chang,Wenrui Dai,Hongkai Xiong,Ling Zhang,Jianpeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:Vision-language pre-training (VLP) holds great promise for general-purpose medical AI by leveraging radiology reports as rich textual supervision, yet existing methods struggle with 3D CT imaging due to inefficient visual backbones and coarse semantic alignment. To address these issues, we propose a tailored VLP framework featuring three key components: (1) a CNN-ViT hybrid encoder that replaces ViT’s patch embedding with a 3D CNN backbone to efficiently capture local anatomical details while preserving global attention and compatibility with pre-trained cross-modal priors; (2) a disease-level contrastive learning mechanism using learnable query tokens to dynamically extract disease-specific semantics from full reports and align them with corresponding visual features, thereby disentangling distinct diseases within the same anatomical region; and (3) a diagnosis-aware prompt strategy that employs real clinical phrases and aggregated disease prototypes to bridge the pre-training-inference gap and enhance zero-shot diagnostic reliability. Our model achieves state-of-the-art performance on CT-RATE (84.4% AUC, +5.1%) and Rad-ChestCT (75.4% AUC, +5.4%), with even larger gains (+9.8% AUC) on a challenging 60-disease benchmark, and demonstrates strong transferability to radiology report generation, underscoring the generality and clinical utility of our approach.

[CV-55] nsorLDM: A Component-Wise Latent Diffusion Model for Volumetric DTI Reconstruction from Sparse DWIs

链接: https://arxiv.org/abs/2606.25545
作者: Junhyeok Lee,Kyu Sung Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing diffusion tensors from sparse DWIs is critical for accelerating Diffusion Tensor Imaging (DTI) in clinical settings, yet current deep learning approaches frequently yield anatomically inconsistent or physically implausible tensors. We introduce TensorLDM, a component-wise latent diffusion model that processes the six tensor components through two group-specific encoders (for diagonal and off-diagonal elements) while maintaining anatomical consistency via shared DWI conditioning. TensorLDM uses an Anatomy-Conditioned Autoencoder that encourages the latent to focus on tensor properties rather than re-encoding structural information. A shared Cross-Component Attention (CCA) mechanism, applied in both autoencoder refinement and diffusion fine-tuning, models inter-component dependencies, while a Mixture-of-Experts (MoE) DWI conditioner provides component-adaptive conditioning. On the Human Connectome Project (HCP) dataset under a single-shell, four-volume sparse acquisition, TensorLDM produces the most accurate downstream tractography and tensors with near-ground-truth physical validity (SPD-violation rate 1.54% vs. 1.40%), with the best or comparable voxel-wise reconstruction accuracy. Geodesic tensor error measured by the Log-Euclidean Metric (LEM) corroborates these gains.

[CV-56] SAC2-Net: Semantic Anchoring and Complementary-Consensus Fusion for Multimodal Micro-Expression Recognition

链接: https://arxiv.org/abs/2606.25542
作者: Xuepeng Zheng,Tong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression recognition (MER) is challenging due to subtle facial movements, limited data, and the ambiguous relationship between Action Units (AUs) and emotion categories. Optical flow and motion magnification have been widely used to describe subtle facial dynamics from different perspectives: the former captures local motion displacement, while the latter amplifies weak appearance changes. In this work, we observe that these two modalities often exhibit asymmetric failure patterns: one modality may become noisy, distorted, or uninformative, while the other still preserves discriminative AU-related evidence. This phenomenon reveals their complementarity, but also raises two key challenges for fusion: cross-modal heterogeneity and spatially varying modality reliability. Motivated by this observation, we propose SAC ^2 -Net, a Semantic Anchoring and Complementary-Consensus Network for multimodal MER, which first aligns visual modalities with semantic anchors and then performs reliability-aware fusion. To reduce cross-modal heterogeneity before fusion, we introduce Semantic Anchoring Soft Alignment (SASA), which converts activated AUs into textual prompts and uses them as stable semantic anchors to align motion-magnified and optical-flow representations. Unlike hard contrastive learning, SASA constructs hierarchical AU-aware soft labels to preserve semantic proximity among samples with overlapping or anatomically related AU patterns. Based on the aligned representations, Complementary-Consensus Fusion (CCF) first repairs unreliable local evidence through complementary exchange and then enforces a shared spatial focus through consensus refinement. Extensive experiments on five MER benchmarks show that SAC ^2 -Net achieves state-of-the-art or highly competitive performance across coarse-grained, fine-grained, large-scale, and cross-dataset evaluation settings.

[CV-57] Spatio-Temporal Mixture-of-Modality-Experts Diffusion for Quantitative DCE-MRI Synthesis from Incomplete MR Sequences

链接: https://arxiv.org/abs/2606.25535
作者: Junhyeok Lee,Kyu Sung Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative maps from dynamic contrast-enhanced MRI (DCE-MRI) are essential for tumor assessment but are often unavailable due to contrast-agent risks and protocol variability. Prior methods predict these maps from other MRI modalities, yet most assume fixed, fully observed inputs and fail under realistic missingness. We present Spatio-Temporal Mixture-of-Modality-Experts (ST-MoME), a conditional diffusion framework that synthesizes 3D DCE parameter maps from diverse subsets of multimodal MRI. ST-MoME fuses modality-specific expert features through a spatio-temporal gating network that produces voxel-wise, timestep-dependent weights, forming a conditioning tensor that guides denoising. To preserve quantitative fidelity, ST-MoME performs diffusion directly in image space with 3D patch-based training and a Swin-based backbone. On a clinical brain-tumor cohort of 386 patients, we evaluate ST-MoME across 16 controlled modality-availability scenarios. It achieves the lowest mean Normalized Mean Square Error (NMSE) aggregated across all three DCE parameters, with leading performance on v_p and v_e , competitive results on K^\mathrmtrans , and the lowest reconstruction error within the clinically critical tumor region. A post-hoc analysis of the learned gating dynamics shows a structural-early, physiological-late fusion schedule consistent with clinical intuition.

[CV-58] PatchINR: Patch-Based Implicit Neural Representations for Efficient and Scalable Inference

链接: https://arxiv.org/abs/2606.25534
作者: Jiachen Ren,Wenyong Zhou,Taiqiang Wu,Yuxin Cheng,Xincheng Feng,Zhengwu Liu,Ngai Wong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representation (INR) provides an effective approach for continuous signal modeling, but classical per-pixel inference results in quadratic growth in inference count, leading to dramatically increased computational costs in high-resolution application scenarios. To address this issue, we propose a patch-based approach that treats non-overlapping patches as fundamental processing units and predicts entire pixel patches in a single forward pass, significantly reducing the number of inference queries required. To validate the effectiveness of our approach, we propose a hardware acceleration architecture on the Field Programmable Gate Array (FPGA) platform for the INR model, which features a configurable pipeline and supports dual-precision computation. Our patch-based INR achieves comparable reconstruction quality to pixel-level INR (34.97 dB PSNR with 2 x 2 patches) while reducing inference latency by 75% with only 0.6% parameter overhead.

[CV-59] ASSCG: Just-Right Gating over Chattering for Fast-Slow LLM Planning in Autonomous Driving

链接: https://arxiv.org/abs/2606.25509
作者: Sining Ang,Yuan Chen,Liu Haiyan,Xuanyao Mao,Jason Bao,Xuliang,Bingchuan Sun,Yan Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can improve autonomous driving planning but are costly to query online, and existing fast-slow planners often rely on hand-designed triggering rules that either over-call the slow system or call it at the wrong times. We formulate slow-system invocation as a resource-aware sequential decision problem and propose the Adaptive Slow-System Control Gate (ASSCG), which makes frame-level Query/Cache/Drop decisions to refresh, reuse, or suppress slow guidance. ASSCG uses an RWKV backbone for efficient long-horizon gating and is trained with supervised fine-tuning followed by GRPO-style compute-aware reinforcement fine-tuning. We apply ASSCG to two different fast-slow architectures: (i) AsyncDriver on nuPlan Hard20 closed-loop evaluation, where ASSCG improves score to 67.28 (+2.28) while reducing average end-to-end inference latency by 60%; and (ii) a RecogDrive-based dual system that we build by replacing its original VLM-2B module with a lightweight ViT-based fast planner and adding an LLM slow planner, evaluated on NAVSIM, where ASSCG achieves 91.4 PDMS (+0.6) and increases average speed by 25%. The project page, including video visualizations and additional results, is available at this https URL.

[CV-60] C2RM-Seg: Causal Counterfactual Reasoning with Structural-Semantic Priors for Weakly Supervised Histopathological Tissue Segmentation

链接: https://arxiv.org/abs/2606.25508
作者: Hualong Zhang,Siyang Feng,Zihan Huan,Yi Qian,Zhenbing Liu,Rushi Lan,Xipeng Pan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures. Code is available at this https URL

点击查看摘要

Abstract:Histopathological tissue segmentation is essential for computer-aided diagnosis, yet weakly supervised methods often suffer from noisy pseudo-labels generated by Class Activation Mapping (CAM). Existing CAM approaches tend to focus on staining-driven appearance cues rather than true causal tissue morphology, resulting in spurious localization and poor structural consistency. To address this issue, we propose C ^2 RM-Seg, a two-stage framework that integrates causal pseudo-label refinement with structure-aware semantic enhancement. For classification, we introduce a Causal Counterfactual Reasoning Module (C ^2 RM) that decomposes features into latent factors and performs counterfactual intervention via a learned causal structure matrix, suppressing confounding context and producing morphology-aligned CAMs. For segmentation, we design a Dual-Path Structural-Semantic Architecture that combines fine-grained structural features from ResNeSt with global semantic priors from a frozen DINOV3 foundation model. A cross-path gating mechanism adaptively regulates semantic injection using local structural cues to preserve boundary fidelity. To further mitigate residual pseudo-label noise, we propose an Uncertainty-Gated Margin (UGM) loss, which dynamically balances margin enforcement and confidence learning based on prediction uncertainty. Extensive experiments on two public histopathological tissue datasets show that C ^2 RM-Seg achieves state-of-the-art performance.

[CV-61] AISPO: Enhancing Depth Reliability for Robotic Manipulation of Non-Lambertian Objects via Affine-Invariant Shape Prior

链接: https://arxiv.org/abs/2606.25503
作者: Zhiming Chen,Linfang Zheng,Kun Zhang,Hyung Jin Chang,Wei Zhang,Hongyu Yu,Hua Chen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IEEE Robotics and Automation Letters. 8 pages. Accepted April 2026

点击查看摘要

Abstract:Reliable depth perception is critical for robotic manipulation, especially for non-Lambertian objects such as transparent or highly specular surfaces, where raw depth measurements are often corrupted or missing. These failures frequently propagate to motion planning, resulting in invalid grasp poses and execution errors. We propose AISPO, a depth completion framework that improves depth reliability for manipulation in challenging sensing conditions. AISPO combines multi-scale RGB-D feature fusion with an affine-invariant shape prior to enforce geometric consistency and mitigate catastrophic depth failures. Unlike methods that focus primarily on average depth accuracy, our approach emphasizes physical plausibility and structural integrity of the predicted depth maps. Extensive benchmark evaluations demonstrate competitive performance and strong generalization to unseen objects and novel scenes. Real-world grasping experiments further show that enhanced depth reliability significantly improves manipulation success rates, particularly for transparent objects where many existing methods fail to produce physically usable depth estimates.

[CV-62] HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

链接: https://arxiv.org/abs/2606.25491
作者: Chuangxin Zhao,Boyan Shi,Yanling Wang,Yijian LU,Canran Xiao,Jiali Chen,Jun Xia,Yan Wang,Ji Qi,Juanzi Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

[CV-63] Cross-View Variance Correlation in Path-Traced Stereo:A Hidden Shortcut in Synthetic Training Data

链接: https://arxiv.org/abs/2606.25483
作者: Po-Ting Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Path-traced synthetic stereo data underlie a large fraction of modern disparity-estimation training pipelines. We report a previously unrecognised property of such data: while the Monte Carlo (MC) noise streams of the two cameras are statistically independent, the underlying \emphvariance fields – deterministic per-pixel functions of the rendering integrand – are highly correlated once aligned by the ground-truth disparity warp. Across 20 scenes rendered with Mitsuba~3, the warped Pearson correlation reaches \rho=0.754\pm0.016 across 20 scenes at \mathrmSPP=512 , and on a representative scene remains essentially invariant ( \rho=0.778\pm0.001 ) over a 16\times range of samples per pixel. The effect is strongest in Lambertian regions ( \rho\approx0.78 ) and substantially weaker in glass ( \rho\approx0.30 ), as predicted by an integrand decomposition into view-independent and view-dependent components. A residual-shuffle intervention that breaks the cross-view alignment while preserving the clean image degrades the GT cost margin by 33% on non-glass and the variance-based winner-take-all accuracy on glass by 4.3\times , confirming the structure functions as a matching cue. This signal is unique to MC-rendered data and constitutes a candidate sim-to-real shortcut whose impact on trained networks remains to be quantified.

[CV-64] ACO: Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition

链接: https://arxiv.org/abs/2606.25478
作者: Minghao Zhu,Xiao Lin,Mengxian Hu,Xun Zhou,Liuyi Wang,Xiaoyan Qi,Chengju Liu,Qijun Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapting CLIP for open-vocabulary video recognition necessitates a delicate balance between newly acquired video knowledge and the pretrained generalization. While existing studies pursue this generalization-specialization trade-off with additional regularizations or constraints, we argue that they overlook the deviation of representations beyond the fine-tuning data distribution, resulting in suboptimal adaptation effects. We believe such deviation is inherited from the inconsistency between the fine-tuning and evaluation objectives, where model optimization is restricted to the known training distribution but evaluated on unseen ones. In this paper, we introduce \emphTACO, a simple yet effective framework to mitigate the potential negative effects induced by this inconsistency. Our key insight is that adaptation should preserve OOD-relevant alignment beyond the training distribution. To this end, we propose \emphRelative Structure Distillation, which regularizes the relative geometry of the representation space and suppresses harmful alignment shift during training. We further decouple the representation space from the optimization space with a lightweight specialization projection, allowing task-specific adaptation without directly overspecializing the representations used at test time. \emphTACO establishes state-of-the-art performance on diverse benchmarks under cross-dataset and base-to-novel settings. Code will be released at this https URL.

[CV-65] Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

链接: https://arxiv.org/abs/2606.25473
作者: Kaiwen Zheng,Guande He,Min Zhao,Jintao Zhang,Huayu Chen,Jianfei Chen,Chen-Hsuan Lin,Ming-Yu Liu,Jun Zhu,Qianli Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:Autoregressive video diffusion with causal diffusion transformers has emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanced diffusion distillation framework, to autoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented by consistency models (CMs) and distribution matching distillation (DMD), respectively, in diffusion distillation. This philosophy naturally carries over to the autoregressive setting, where teacher-forcing (TF) provides an offline, forward-divergence causal training paradigm, while self-forcing (SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy (2) we present the first implementation of teacher-forcing-based continuous-time CMs (e.g., sCM/MeanFlow) for autoregressive video diffusion, enabled by our custom-mask FlashAttention-2 JVP kernel, achieving 10 \times faster convergence compared to discrete-time CMs (dCMs) (3) we introduce Causal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps. We further apply Causal-rCM to Cosmos 3, an advanced omnimodal world foundation model for physical AI with action-conditioned generation capability, enabling an interactive world model. Comments: Technical Report Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.25473 [cs.CV] (or arXiv:2606.25473v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.25473 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-66] EchoStyle: Unlocking High-Fidelity Video Stylization with Reverse Data Synthesis

链接: https://arxiv.org/abs/2606.25465
作者: Huaqiu Li,Jiahao Wang,Sijia Cai,Hualian Sheng,Bing Deng,Jieping Ye,Wenhan Luo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While image stylization has been studied extensively, video stylization remains a critical and largely unsolved challenge in the field of intelligent content creation. Existing methods, usually utilizing a reference image as the style prior, suffer from content leakage, data scarcity and limited adaptability to long videos, leading to suboptimal results with severe style drift and motion distortion. For these issues, we present EchoStyle, a scalable text-driven framework to achieve high-quality stylization of videos with arbitrary lengths. To start with, we construct a video-to-video architecture to appropriately re-fuse the video content and the text style. To address data scarcity, we pioneer an automatic reverse-synthesis pipeline to establish V-Style20k, a large-scale stylization dataset of 20k high-quality video pairs. To facilitate long video stylization, we devise an init-follow-mode mechanism along with a sliding-window inference strategy. Extensive experiments demonstrate EchoStyle’s excellent performance across a wide range of artistic styles, even comparable to leading closed-source solutions.

[CV-67] C3-Bench: A Context-Aware Change Captioning Benchmark ECCV2026

链接: https://arxiv.org/abs/2606.25445
作者: Jae-Woo Kim,Hyeongbeom Kim,Ue-Hwan Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2026 Camera-ready version

点击查看摘要

Abstract:While Change Captioning systems have garnered substantial attention to respond to our evolving world, their true performance on diverse real-world change contexts remains largely unexplored due to the lack of comprehensive evaluation frameworks. To fill this gap, we propose C3-Bench, a comprehensive benchmark for evaluating Context-aware Change Captioning. C3-Bench features: (1) 4,996 human-labeled image pairs of 51 real-world change contexts across four domains (e.g., natural scenes, remote sensing imagery, image editing, and anomalies), each with diverse, carefully curated scenarios derived from multiple change-centric communities; and (2) the first LLM-as-Judge evaluation framework in the change captioning task that measure fine-grained dimensions (e.g., correctness, specificity, fluency, and relevance), along with a novel reversibility metric exploring whether models understand changes with symmetric consistency. Based on C3-Bench, we benchmark 32 models – including conventional change captioning models, proprietary Large Multimodal Models (LMMs), and 2B-90B open-source LMMs. We reveal a fundamental blind spot in the prevailing change captioning paradigm: Once the change context departs from training-style regimes, conventional models collapse, and even state-of-the-art LMMs such as GPT-5.2 exhibit systematic domain- and position-dependent errors that distort reliable change understanding. By making these hidden failure modes explicit and measurable, we delineate the next frontier for building generalizable and trustworthy change captioning systems. All codes and datasets are publicly available on the project page.

[CV-68] LinStereo: Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching

链接: https://arxiv.org/abs/2606.25437
作者: Yiran Wang,Oliver Turner,Viorela Ila
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Vision Foundation Model (VFM)-based iterative stereo pipelines under-exploit three information pathways: multi-scale backbone features are collapsed into single-level correlations, geometric priors remain untapped at initialization, and context propagates only locally. These gaps widen under degraded photometric cues, making underwater scenes a stringent generalization test. To address this, we propose LinStereo, built upon Depth Anything V3, whose core is a Position-Aware Linear Attention (PALA) module that replaces local recurrence with global aggregation at linear cost, propagating reliable estimates from well-matched regions into degraded areas while preserving disparity structure. PALA is made effective by two enabling components: Hierarchical Semantic Cost Volumes (HSCV), which supply scale-aligned correlations from the VFM feature hierarchy, and a Depth Prior Initialization (DPI) that converts monocular depth into a metrically calibrated warm start. LinStereo achieves state-of-the-art-level accuracy on standard benchmarks and strong cross-domain generalization, particularly on underwater scene where severe photometric degradation makes stereo matching particularly challenging, attaining the best overall accuracy with consistent gains 28% lower AbsRel on TartanAir-UW, 26% on SQUID, a real-world underwater dataset).

[CV-69] Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

链接: https://arxiv.org/abs/2606.25432
作者: DatologyAI:Matthew L. Leavitt,Siddharth Joshi,Haoli Yin,Rishabh Adiga,Haakon Mongstad,Alvin Deng,David Schwab,Bogdan Gaza,Ari Morcos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, see this https URL for more information

点击查看摘要

Abstract:Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it is precisely the component the standard toolkit leaves untouched. Here, we argue that brevity is the missing inference-efficiency lever, and that pretraining data curation is a practical way to pull it: a model trained on concise, correct data learns to answer in fewer tokens; i.e. it has a lower Cost-of-Pass. We apply our VLM curation pipeline to the MAmmoTH-VL single-image subset, and compare models trained on our curated data, the standard MAmmoTH-VL data, and external open-weight frontier VLMs. On a controlled 20-evaluation set and 14 VLMs at 1B-4B activated parameters, we hold output length fixed with a per-model regression, separating brevity from quality, and price models in FLOPs per correct answer. Curation buys a 35x Cost-of-Pass advantage over the most verbose 4B comparator (Qwen3.5-4B) within \sim 1 pp of accuracy (0.41 vs 14.58 TFLOPs per correct answer; 0.691 vs 0.704 mean accuracy). Curation also buys a +17.55-percentage-point matched-length accuracy gain over the uncurated baseline that grows with model scale (from +16.7 pp at 1B to +21.2 pp at 4B). This brevity improvement concedes no quality: generic verbosity buys no accuracy at any capability or scale, and the window where reasoning-structured verbosity still earns its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B. Per example, the concise model even reaches correct answers the verbose reasoning model misses, marking reasoning as a distinct curation target rather than something brevity gives up. Inference efficiency in this regime is a tokens-per-correct problem, and brevity is the lever that targets it directly.

[CV-70] PRISM: Feed-Forward Single-Image 3D Reconstruction via Geometric Warp-Residual Modeling

链接: https://arxiv.org/abs/2606.25430
作者: Zhijie Zheng,Xinhao Xiang,Jiawei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D scenes from a single image is a fundamental challenge in computer vision, with broad applications in virtual reality, robotics, and content creation. Recent methods achieve outstanding performance by leveraging camera-controlled video diffusion models, but rely on iterative diffusion sampling, which greatly limits their practical deployment. We observe that geometric forward warping alone can cover the majority of a target view directly from the input image, with only a compact residual left for the encoder to correct. Motivated by this observation, we propose PRISM, a feed-forward framework that decomposes multi-view latent prediction into a parameter-free geometric prior and a learned residual correction, with no diffusion sampling required at inference. To enable generalization from purely synthetic training data, we devise a two-stage training strategy combining latents supervised distillation for geometric generalization and perceptual fine-tuning for appearance quality optimization. Extensive experiments on three benchmarks demonstrate that PRISM achieves competitive reconstruction quality compared with diffusion-based methods, while reducing inference time dramatically to only 36 seconds per scene.

[CV-71] Gastroendoscopy View Synthesis: A New Real Dataset and Evaluation WWW

链接: https://arxiv.org/abs/2606.25427
作者: Masaki Minai,Yusuke Monno,Masatoshi Okutomi,Sho Suzuki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for EMBC 2026. Project page: this http URL

点击查看摘要

Abstract:Novel view synthesis (NVS) is an active research topic in computer vision, owing to the success of neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) methods. While NVS opens the door to potential applications in gastroendoscopy, such as extending the field of view of endoscopic images and enabling digital twins for 3D archiving and endoscopist manipulation training, the dataset is insufficient to evaluate NVS for gastroendoscopy. In this paper, we present the first real gastroscopy dataset for NVS, namely the GastroNVS dataset, which contains a set of gastroscopic images, camera poses, and a point cloud for real gastroendoscopy inspection. To assess the suitability of the GastroNVS dataset, we evaluate several 3DGS methods and discuss the challenges for future development. The dataset is available on request from our project page.

[CV-72] ach-to-Reason : Competition-Guided Reasoning with a Self-Improving Teacher

链接: https://arxiv.org/abs/2606.25407
作者: Xiao Han,Hao Liu,Zhimin Bao,Jile Jiao,Yue Wang,Hui Guo,Xiaofeng Mou,Yi Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray visual question answering (CXR VQA) requires models not only to predict correct answers, but also to produce reliable medical reasoning. However, existing reinforcement-learning-based training typically relies on answer-level rewards, which are often too coarse to improve chain-of-thought (CoT) quality and can become ineffective when group-level advantages collapse to zero. We propose \textbfTeach-to-Reason (T2R), a framework that introduces comparison-based supervision into CoT optimization through a self-improving \emphTeacher and a competition-guided \emphReasoner. As the Teacher is iteratively strengthened via self-competition, the Reasoner is optimized against progressively stronger Teacher-generated references. We further introduce a case-wise reward design that preserves the original reward-induced positive/negative partition when it is informative, and restores supervision from competition scores when the original reward signal degenerates. Experiments on multiple CXR open-ended VQA benchmarks show that T2R consistently outperforms strong baselines, indicating that comparison-based supervision, when integrated in a controlled and principled manner, provides a more effective training signal for reasoning optimization.

[CV-73] Anatomically-conditioned Latent Diffusion Model for Data-Efficient Few-Shot Cross-Domain 3D Glioma MRI Synthesis

链接: https://arxiv.org/abs/2606.25390
作者: Salman Shaik,Truong Thanh Hung Nguyen,Hung Cao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in Canadian AI 2026

点击查看摘要

Abstract:Accurate classification of diffuse gliomas is often hindered by domain shifts across centers and a lack of large, annotated datasets. We propose the Anatomically-conditioned Latent Diffusion Model (ALDM), a novel framework for data-efficient, few-shot 3D volumetric MRI synthesis. ALDM utilizes a two-stage approach: a 3D variational autoencoder learns anatomical priors from a data-rich source domain, while a conditional latent diffusion model, guided by tumor masks via a ControlNet, generates structurally coherent volumes for a data-scarce target domain. Evaluated in an extreme few-shot setting with only 16 target images, ALDM outperformed GAN and hybrid baselines, achieving a superior Frechet Inception Distance (FID) of 85.40 and a downstream classification AUC of 0.987. Qualitative results confirm that the model preserves sharp pathology boundaries and cross-modal consistency, with visual fidelity improving progressively during training. By capturing essential diagnostic features, ALDM provides a robust tool for clinical data augmentation in low-resource settings. Our implementation is available at this https URL.

[CV-74] ransferable Attack against Face Swapping in an Extended Space

链接: https://arxiv.org/abs/2606.25376
作者: Mingzhi Lyu,Yi Huang,Jun Xie,Zihao Zhao,Hong Xu,Adams Wai-Kin Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although deep Face Swapping (FS) models may benefit the entertainment industry, they pose severe threats to privacy and security. Existing protections, including deepfake detection and adversarial perturbation, are either passive responses or ineffective to unseen subject-agnostic FS models. In this paper, we propose a transferable attack against subject-agnostic FS models named Additive Identity attack based on a Relighting function (AIR). AIR leverages reillumination and additive perturbations to mislead the identity extraction modules in subject-agnostic FS models. By using these two types of perturbations simultaneously, the attack space is extended such that stronger but more visually natural adversarial examples can be identified. To further enhance the visual quality while preserving the effectiveness of the attack, an adaptive translation-invariant operation and an illumination control scheme are designed for AIR. Unlike other methods, AIR does not require a surrogate FS model to achieve high transferability. In addition, a mathematical proof is given for the extension of the attack space. Extensive experiments using 1000 image pairs across various state-of-the-art subject-agnostic FS models, including GAN and diffusion-based FS models, show that AIR surpasses all existing attacks in terms of both attack success rate and image quality.

[CV-75] Beyond Visual Forensics: Auditing Multimodal Robustness for Synthetic Medical Image Detection MICCAI2026

链接: https://arxiv.org/abs/2606.25375
作者: Ching-Hao Chiu,Hao-Wei Chung,Gelei Xu,Xueyang Li,Pin-Yu Chen,John Kheir,Meysam Ghaffari,Carlos Morato,Ahmed Abbasi,Yiyu Shi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:With the rapid adoption of generative AI, synthetic medical images pose growing risks, including diagnostic deception and insurance fraud. Although prior work has explored vision-language model (VLM)-based synthetic image detection, these evaluations typically consider images in isolation. In clinical practice, however, images are interpreted alongside structured records and metadata, and VLMs are increasingly deployed under joint image-record inputs. We uncover a previously underexamined multimodal vulnerability: when given both modalities, VLMs may overweight record context in authenticity judgments, such that the same image receives different predictions solely due to changes in its accompanying text. This raises concerns about robustness in real-world deployment. To systematically characterize this effect, we reformulate synthetic medical image detection as an audit of multimodal robustness at the image-record interface and introduce a paired benchmark that holds the image fixed while swapping controlled metadata variants. Across multiple imaging modalities, we evaluate diverse open-weight and frontier API VLMs and quantify how metadata alone shifts authenticity predictions. Our benchmark provides a standardized tool for assessing and improving multimodal robustness beyond image-only settings. The code is available at this https URL.

[CV-76] Hypergraph Normal World Models for Logical Visual Anomaly Detection

链接: https://arxiv.org/abs/2606.25368
作者: Weizhi Nie,Zibo Xu,Weijie Wang,Yuting Su
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:Visual anomaly detection is often deployed with only normal training images. Most one-class detectors map test patches or features to a normal reference distribution. This works well for local structural defects. Logical anomalies are different. Each visible part may look normal, while the whole image violates a normal count, co-occurrence, or spatial relation. This paper studies whether a model can learn such a category-specific normal world from nominal images alone. We propose the Hypergraph Normal World Model, a normal-only detector that distills frozen DINOv2 patch tokens into patch, relation, and hypergraph statistics. It builds spatial hyperedges over token groups. It then scores each test image with an information quotient that separates local, relational, hyperedge, and hyperedge-relation evidence. On the available MVTec LOCO breakfast-box validation data, the full hypergraph model improves logical anomaly AUROC from 0.8434 for DINOv2 patch-kNN to 0.9279. It also improves over the non-hypergraph variant, from 0.9013 to 0.9279. Few-shot experiments show that the model remains effective with very limited normal images. We also test whether the score reflects normal-world knowledge rather than a shallow mapping. t-SNE separates logical anomalies in the learned energy space. Relation counterfactuals increase the information quotient by 83.13 on average. Random hypergraphs reduce logical AUROC, and hyperedge attribution is much larger on logical anomalies. Qualitative examples show that high scores are driven by relation-bearing terms. These results suggest that logical visual anomaly detection should model normal relations, not only normal local patches.

[CV-77] Geometry-Anchored Transport Framework for Exemplar-Free Class-Incremental Learning ECCV2026

链接: https://arxiv.org/abs/2606.25347
作者: Hongye Xu,Bartosz Krawczyk
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. 17 pages, 4 figures, 3 tables. Code: this https URL

点击查看摘要

Abstract:Exemplar-free class-incremental learning (EFCIL) requires stable decision boundaries within a shifting feature space. While maintaining class-conditional Gaussian statistics provides a principled classification strategy, these parametric summaries remain sensitive to anisotropic representation drift. Existing methods often transport these statistics across tasks using a decoupled, post-hoc paradigm: optimizing a backbone without explicit geometric constraints can distort the legacy manifold, limiting the precision of retroactive alignment. In this paper, we formulate feature transport as an endogenous training constraint rather than a separate post-task step, presenting the Geometry-Anchored Transport Framework. First, we derive an Analytic Geometric Anchor via Mahalanobis-aligned regression to mitigate macroscopic anisotropic drift. Second, we introduce a Topology-Aware Evolution objective that regularizes localized manifold degradation while calibrating a residual network against the analytic prior. By coupling manifold evolution with transport constraints during the primary training phase, our framework mitigates evaluation errors without requiring decoupled fine-tuning. Experiments across CIFAR-100, TinyImageNet, and ImageNet-100 demonstrate that the proposed framework consistently improves upon existing post-hoc alternatives under strict exemplar-free constraints.

[CV-78] Follow Your Track: Precise Skeleton Animation Controlled by 3D Trajectories

链接: https://arxiv.org/abs/2606.25344
作者: Yueting Liu,Yanqin Jiang,Nian Liu,Jingmen Zhou,Zhengjun Zha,Weiming Hu,Jin Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D generation aims to animate 3D objects with realistic motion, holding great promise for applications. Existing methods typically decouple 3D asset generation from motion synthesis: acquire a 3D asset, prepare a structural representation like mesh and Gaussians, and synthesize motion from text or video control signals. However, dense mesh and Gaussian representations incur high computational costs and are prone to temporal artifacts, limiting animation quality and duration to only short clips. Meanwhile, text lacks fine-grained spatial and temporal details such as timing and coordination, while video entangles motion with appearance and background. Together, these limitations result in 4D animations that suffer from poor temporal consistency, wrong identification, and limited controllability. We address these issues with \textttACT, a trajectory-conditioned framework for topology-general skeletal animation. ACT uses skeletons as a compact structured and compute-efficient representation and 3D point trajectories from monocular video as explicit motion guidance which provide detailed motion patterns without appearance entanglement. At the core of ACT is a Routed Trajectory Injector, which achieves accurate and robust trajectory-to-joint transfer through three complementary designs: prior-guided hard routing establishes precise skeleton-to-mesh correspondences, global routing enables holistic joint-track interaction for full-body motion awareness, and local windowed cross-attention enforces fine-grained temporal alignment, improving micro-timing and reducing motion misalignment across varying motion rates. Extensive experiments demonstrate that \textttACT significantly outperforms existing methods in fidelity and temporal consistency.

[CV-79] Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity ECCV2026

链接: https://arxiv.org/abs/2606.25343
作者: Heethanjan Kanagalingam,Thenukan Pathmanathan,Mokeeshan Vathanakumar,Basim Azam,Sarah Monazam Erfani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to presentation at ECCV 2026

点击查看摘要

Abstract:Vision Language Models have achieved near-human performance on single-document Visual Question Answering, yet their effectiveness degrades significantly when retrieving information from large collections of visually homogeneous documents. Existing multi-document benchmarks aggregate diverse document types, creating artificial separation in embedding space that does not reflect enterprise document repositories where thousands of records share identical visual templates. We identify this as embedding collapse and introduce Invoice Haystack, a benchmark with 1,500 anonymized invoice images paired with 200 discriminative question-answer pairs, specifically designed to stress-test retrieval under strong visual homogeneity. Invoice Haystack exhibits a mean pairwise cosine similarity of 0.73, compared to 0.38 (DocHaystack) and 0.31 (InfoHaystack) in existing benchmarks, posing a fundamentally more challenging retrieval problem. Addressing the identified challenge, we propose VL-RAG, a hybrid retrieval-augmented generation framework that jointly leverages text and visual embeddings to harness the complementary strengths of both modalities, followed by a VLM-based verification filter for precise document identification. VL-RAG achieves 60.0% Recall@1 on Invoice Haystack-500, outperforming existing state-of-the-art method by up to an absolute 13.5 percentage points. It further improves retrieval considerably on DocHaystack-1000 (77.1% vs.\ 75.2%) and InfoHaystack-1000 (84.5% vs.\ 80.0%), establishing the proposed dual-stream fusion as a consistently superior retrieval strategy across both homogeneous and heterogeneous document collections.

[CV-80] State Space Models Meet Remote Sensing: A Survey

链接: https://arxiv.org/abs/2606.25329
作者: Qinzhe Yang,Chenyang Liu,Jia Xu,Zhenwei Shi,Zhengxia Zou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 25 pages, 5 figures, has been published in SCIS SCIQ1 IF=8.1 this https URL

点击查看摘要

Abstract:State Space Models (SSMs), designed for long-range modeling, offer linear computational complexity and strong capabilities in capturing long-range dependencies. In the field of remote sensing, SSMs have gained popularity due to their effectiveness in addressing unique challenges such as dense visual predictions, multi-modal remote sensing data, and temporal remote sensing data, which have also yielded significant advancements in customized architectures. This paper presents a comprehensive review of SSM-based approaches in remote sensing, covering most of the relevant studies since SSMs were first introduced to the field. We offer a multi-dimensional analysis examining SSM applications in remote sensing tasks and discussing advancements in architecture design. This paper not only synthesizes the rapid progress in SSM-based research but also identifies key challenges and future opportunities. By providing a detailed perspective, this paper aims to serve as a foundational resource for remote sensing researchers, offering actionable insights to foster further advancements in this evolving domain. We will keep tracing related works at this https URL.

[CV-81] Efficient Remote Sensing Instance Segmentation with Linear-Time State Space Distilled Visual Foundation Models

链接: https://arxiv.org/abs/2606.25324
作者: Qinzhe Yang,Keyan Chen,Jia Xu,Zhenwei Shi,Zhengxia Zou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures, has been published in IEEE TGRS vol. 64, pp. 5625417-5625417, 2026, Art no. 5625417, doi: https://doi.org/10.1109/TGRS.2026.3696104

点击查看摘要

Abstract:The computational complexity of Transformers scales quadratically with the number of tokens, which significantly constrains the efficiency of vision models, particularly recent ViT-based foundation models in dense prediction tasks. Instance segmentation, a typical dense visual prediction task in the remote sensing field, faces similar challenges. In this paper, inspired by the recent advances of knowledge distillation in large language models, we introduce RS4D - a new remote sensing instance segmentation method with linear computational complexity, which addresses the inefficiency of long sequence modeling through distilled state space modeling (SSM). We propose an adaptive noise and masking knowledge distillation training method for pre-training lightweight SSM backbones, which effectively compresses knowledge from the vast self-attention space into a compact, dense linear state space. We also design a remote sensing image instance segmentation architecture based on this lightweight visual encoder, where we explore variants of three different backbones and two segmentation heads. Extensive experiments are conducted on multiple benchmark datasets, including SSDD, WHU, and NWPU. Compared to ViT-based approaches, our proposed SSM backbone achieves an 8x reduction in parameters and a 9x reduction in FLOPs while maintaining comparable or superior accuracy to both ViT- and CNN-based instance segmentation methods. The implementation codes have been publicly available at this https URL.

[CV-82] V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

链接: https://arxiv.org/abs/2606.25319
作者: Haoxiang Sun,Zhihang Yi,Langxuan Deng,Yuhao Zhou,Peiqi Jia,Jian Zhao,Li Yuan,Jiancheng Lv,Tao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5 \times faster than previous supervised fine-tuning methods and more than 10 \times faster than reinforcement learning baselines. Code and dataset will be released at this https URL

[CV-83] REViT: Roto-reflection Equivariant Convolutional Vision Transformer ICML2026

链接: https://arxiv.org/abs/2606.25318
作者: Sheir A. Zaheer,Alexander C. Holston,Chan Y. Park
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at ICML 2026

点击查看摘要

Abstract:In this paper, we propose a discrete roto-reflection group equivariant vision transformer with convolutional attention. Roto-reflection equivariant networks preserve the rotational, flip and positional symmetry in feature maps, making them useful for tasks where orientation of the inputs is relevant to the model outputs. In image classification and object detection, most of the studies on roto-reflection equivariant models have focused on using convolutional neural networks rather than vision transformers. In this paper, we examine the challenges involved in achieving equivariance in vision transformers, and we propose a simpler way to implement a discretized roto-reflection group equivariant vision transformer. The experimental results demonstrate that our approach outperforms the existing approaches for developing discrete roto-reflection group equivariant neural networks for image classification.

[CV-84] ESTANet: Efficient Online Error Detection in Procedural Videos via Prediction Inconsistency ECCV

链接: https://arxiv.org/abs/2606.25317
作者: Shih-Po Lee,Reza Ghoddoosian,Faizan Siddiqui,Enna Sachdeva,Behzad Dariush
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, uses this http URL

点击查看摘要

Abstract:An efficient and accurate system for detecting errors in procedural tasks is crucial for supporting human needs in daily life, as it can provide instant notifications and guide people to correct mistakes. In this work, we study real-time online error detection in procedural videos from a simple but overlooked perspective: the prediction behavior of action detectors themselves. Instead of designing complex architectures or specialized supervision, we observe that action detectors naturally exhibit different prediction characteristics depending on their sensitivity to input dynamics and temporal context. We therefore propose ESTANet (Error-Sensitive and Temporally-vArying Network), a lightweight framework that detects errors by exploiting inconsistencies among action predictions produced by a small set of action detectors. We construct standard and error-sensitive action detectors that behave similarly on correct executions but respond differently when errors occur. Meanwhile, detectors operating with different temporal contexts further amplify prediction inconsistencies when the procedure deviates from the intended sequence. During inference, we detect errors by aggregating mismatches between standard and error-sensitive predictions through majority voting to flag frames that contain errors. Extensive experiments on EgoPER, Assembly-101-O, and EPIC-Tent-O demonstrate that ESTANet achieves state-of-the-art performance in online error detection while maintaining real-time efficiency with a lightweight architecture. Our results highlight that leveraging the intrinsic properties of action detectors can yield a powerful and practical solution for online error detection without increasing architectural design complexity.

[CV-85] LEVIRDet: A Million-Scale 159-Category Dataset and Foundation Model for Universal Remote Sensing Object Detection

链接: https://arxiv.org/abs/2606.25312
作者: Qinzhe Yang,Dongyu Wang,Haohan Niu,Jia Xu,Zhenwei Shi,Zhengxia Zou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Remote sensing object detection has advanced rapidly with the development of large-scale benchmarks and modern detection architectures. However, existing datasets and detectors remain fragmented. Most benchmarks focus on limited categories, fixed spatial resolutions, or a single sensor, while detectors still struggle to work across different sensors and categorical systems. In this paper, we introduce LEVIRDet-159, the largest and most comprehensive remote sensing object detection dataset to date, with 159 categories, 2.56 million bounding boxes, and 700k fine-grained annotations under a multi-level taxonomy. In each key scale dimension, LEVIRDet-159 exceeds the corresponding largest existing remote sensing object detection dataset, containing approximately (7x) more images, (6x) more object instances, and (4x) more categories. Based on this dataset, we design LEVIRDetNet, a scale-hierarchy-aware detection foundation model for universal remote sensing object detection. LEVIRDetNet couples online visual Ground Sampling Distance (GSD) prediction, GSD-conditioned query modulation and allocation, and a hierarchy-aware detection head for mixed-granularity remote sensing supervision. Under stringent evaluation settings, LEVIRDetNet demonstrates strong cross-domain generalization. Even without target-domain training or fine-tuning, it achieves state-of-the-art detection performance on 9 external benchmarks, improving the strongest fully supervised competing methods by 5.02 mAP on average under each benchmark’s primary metric. We hope this study will facilitate the development of strongly generalizable remote sensing object detection across diverse category systems, spatial resolutions, and sensor platforms. The dataset and trained models will be released at this https URL, accompanying the final paper.

[CV-86] Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation ECCV2026

链接: https://arxiv.org/abs/2606.25306
作者: Atin Pothiraj,Jaemin Cho,Yue Zhang,Elias Stengel-Eskin,Mohit Bansal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2026. Code and data: this https URL

点击查看摘要

Abstract:Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG’s fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.

[CV-87] HiFiVe: High-Fidelity Vehicle Generation Leverag ing Auto-Regressive 2D Generative Priors

链接: https://arxiv.org/abs/2606.25300
作者: Hongli Xiao,Youjian Zhang,Qi Zheng,Zhaohui Hu,Yaohui Jin,Xiaoguang Ren,Wenjing Yang,Long Lan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D vehicle generation methods often suffer from low geometric fidelity and blurry textures, hindering their downstream applications. While recent works adopt multi-view diffusion models for high-fidelity texture, they are often constrained by fixed viewpoints, limited resolution, and a reliance on costly fine-tuning to achieve cross-view consistency. In this paper, we propose HiFiVe, a training-free framework for high-fidelity vehicle modeling through joint texture and geometry enhancement by imposing 3D geometric constraints to anchor 2D generative priors. Specifically, we propose an auto-regressive texture refinement pipeline that progressively synthesizes high-resolution textures from arbitrary viewpoints. To ensure cross-view consistency, the coarse geometry serves as a synchronization prior, conditioning each generation step on previously synthesized frames via depth-based warping and multi-view texture fusion. Moreover, the inherent symmetry of vehicles is exploited to mitigate error accumulation. Finally, high-frequency surface details are recovered by refining the mesh geometry using normal maps estimated from the enhanced textures. Extensive experiments on synthetic and real-world vehicle datasets demonstrate that our method significantly improves both geometric detail and texture quality compared to state-of-the-art baselines.

[CV-88] KidRisk: Benchmark Dataset for Children Dangerous Action Recognition

链接: https://arxiv.org/abs/2606.25298
作者: Minh-Kha Nguyen,Trung-Hieu Do,Kim Anh Phung,Thao Thi Phuong Dao,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024

点击查看摘要

Abstract:Children are naturally energetic, and during their spontaneous activities, they often encounter potentially dangerous situations, especially when lacking parental supervision. Identifying actions that pose risks plays a crucial role in ensuring their safety. This paper build a novel challenging dataset, namely KidRisk, including 2,500 short videos of children’s actions and 10,000 images for dangerous action of children. We also introduce a benchmark on our newly constructs dataset and find that traditional deep learning models demonstrated limited effectiveness on these tasks. Therefore, we develop vision-language based baselines with exceptional context understanding of visual information. Our proposed methods achieved an accuracy of 83.53% in classifying children’s actions and 96.14% in recognizing children’s dangerous actions, significantly outperforming traditional approaches. These results confirm that vision-language models are not only feasible but also highly effective in detecting hazardous actions, contributing positively to safeguarding children’s safety.

[CV-89] Minimalist Preprocessing Approach for Image Synthesis Detection

链接: https://arxiv.org/abs/2606.25297
作者: Hoai-Danh Vo,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024

点击查看摘要

Abstract:Generative models have significantly advanced image generation, resulting in synthesized images that are increasingly indistinguishable from authentic ones. However, the creation of fake images with malicious intent is a growing concern. Low-configured smart devices have become highly popular, making it easier for deceptive images to reach users. Consequently, the demand for effective detection methods is increasingly urgent. In this paper, we introduce a simple yet efficient method that captures pixel fluctuations between neighboring pixels by calculating the gradient, which highlights variations in grayscale intensity. This approach functions as a high-pass filter, emphasizing key features for accurate image distinction while minimizing color influence. Our experiments on multiple datasets demonstrate that our method achieves accuracy levels comparable to state-of-the-art techniques while requiring minimal computational resources. Therefore, it is suitable for deployment on low-end devices such as smartphones. The code is available at this https URL.

[CV-90] Evaluation Protocols and Validation for Cameras in Indoor Healthcare Monitoring

链接: https://arxiv.org/abs/2606.25284
作者: Amirhossein Dadashzadeh,Jingjing Liu,Qianhui Men,Qiushuo Cheng,Kirsty Scott,Lisa Alcock,Ian Craddock,Majid Mirmehdi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-based monitoring systems are increasingly adopted in healthcare settings for the continuous assessment of patient movement and activities. However, their technical performance under real-world indoor conditions remains insufficiently characterised, preventing appropriate camera selection for clinical or home adoption and reproducibility. Existing validation studies typically assess either device metrological performance or algorithm accuracy in isolation, and often do not systematically account for practical deployment factors, such as lighting variability, occlusions, and camera positioning. We present two technical validation protocols: the first evaluates the metrological performance of RGB and RGB-D cameras, and the second assesses their use in supporting human pose estimation, validated using state-of-the-art pose estimators. The proposed protocols systematically assess five cameras, four RGB-D and one RGB, under controlled variations in lighting, camera height, viewing angle, and occlusion level within representative indoor scenarios. The experimental results show that metrological performance varies substantially across cameras, with depth bias at 5 m ranging from 50 mm to over 1400 mm depending on the device. For 2D pose estimation, all cameras achieve broadly comparable accuracy, with mean mAP between approximately 78% and 90% across cameras and estimators, whereas 3D reconstruction error differs markedly across devices, with MPJPE ranging from 104 mm to 365 mm, closely reflecting underlying depth-sensing quality. Environmental factors have a camera- and estimator-dependent effect on 3D performance, while camera mounting height has minimal influence within the evaluated range. This work provides evidence-based guidance for the selection and deployment of cameras in healthcare monitoring applications, addressing an important gap in current technical validation practice.

[CV-91] MRI2Rep: Autoregressive Structured Report Generation for 3D Liver MRI MICCAI2026

链接: https://arxiv.org/abs/2606.25279
作者: Xinran Li,Junlin Yang,Annabella Shewarega,Zongwei Zhou,Julius Chapiro,James S. Duncan,Lawrence H. Staib
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026

点击查看摘要

Abstract:Manual reporting of 3D MRI studies is time-consuming, yet end-to-end structured report generation for 3D liver MRI remains underexplored due to volumetric complexity and scarce paired data. We propose MRI2Rep, an autoregressive framework for liver MRI report generation. From 3,929 real-world MRI-report pairs acquired over a 10-year single-institution cohort, a Report-to-Label Canonicalization (RLC) module converts free-text reports into structured, closed-vocabulary diagnostic sequences without lesion-level annotations. On a held-out test set, MRI2Rep achieves 76.0% case-level sensitivity, 29.4% lesion-level F1, compared with no more than 8.3% for adapted medical vision-language baselines, and 82.4% liver-level accuracy. In a blinded reader study, two radiologists rated 75% and 70% of AI-generated reports as clinically acceptable, compared with 95% and 100% for original reports. Our automated LLM-based judge, LLM-Eval, rated 61.8% of AI-generated reports as acceptable, applying a stricter standard and supporting its use as a conservative proxy. To our knowledge, this is the first end-to-end LI-RADS-structured reporting system for 3D liver MRI.

[CV-92] Heterogeneous and Adept Snapshot Distillation for 3D Semantic Segmentation

链接: https://arxiv.org/abs/2606.25278
作者: Xiaopei Wu,Yuenan Hou,Junkai Xu,Wenxiao Wang,Binbin Lin,Yu Li,Ping Li,Haifeng Liu,Deng Cai,Wanli Ouyang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Multi-modal fusion and multi-model ensembling are prevalent in enhancing the performance of 3D semantic segmentation. Despite the impressive performance, these methods either rely on auxiliary input signals or suffer from costly computational expense. To efficaciously enhance the segmentation performance without introducing intolerable costs, we propose to transfer the rich knowledge from the multi-modal model (i.e., point clouds and images) and multiple model experts to the point-cloudbased network through knowledge distillation. Specifically, we present Information-oriented Heterogeneous Distillation (IHD) to help the uni-modal model absorb the complementary knowledge from the multi-modal teacher. We design the Information-Oriented Filtering (IOF) strategy to select informative images from the continuous image sequence for multi-modal fusion. This practice can boost the performance of the multi-modal teacher, thus benefiting the learning of the student. Besides, as opposed to vanilla model ensembling that requires the separate training of each expert, we propose Adept Snapshot Distillation (ASD). ASD treats the freely available model snapshots generated during the training phase as multiple experts, which significantly reduces the training cost for model ensembling. For each expert teacher, it only provides supervision to the student in the class where it is adept. The resulting Heterogeneous and Adept Snapshot Knowledge Distillation, dubbed HAS-KD, attains state-of-the-art results on ScanNetV2 and S3DIS datasets. HAS-KD can be seamlessly integrated into contemporary 3D segmentation algorithms and bring considerable gains without introducing extra inference burdens. The code will be made publicly available upon publication.

[CV-93] An Integrated Hardware-Software Design for Low-Data Spatial Defect Detection in Robotic Visual Inspection with Hybrid Optoelectronic Neural Networks

链接: https://arxiv.org/abs/2606.25277
作者: Chaoqing Tang,Jiaxuan Li,Huanze Zhuang,Guiyun Tian,Chao Wang,Yihao Ouyang,Wenzhong Liu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To address data overload and inefficient shape-level annotation in robotic visual inspection, this paper proposes a hardware-software integrated optoelectronic architecture. A non-imaging, low-data paradigm is established to minimize annotation dependency. First, a sensor-in-the-loop strategy reconfigures a Digital Micromirror Device (DMD) as a physical optical convolutional layer, enabling photonic-domain feature extraction that unifies sensing hardware and processing software. To suppress data volume at the source, a block-based compressed sensing strategy encodes spatial information into low-dimensional temporal signals, drastically reducing redundancy. Subsequently, to bypass laborious manual defect shape annotation, natural language descriptions guide the network to align with highly generalizable features from Contrastive Language-Image Pre-training (CLIP), steering the attention maps of the optoelectronic neural network toward defect shapes. Furthermore, a Localization Accuracy for Attention (LAA) metric is proposed to quantify shape-level defect localization performance. Experiments on transparent material defect detection validate the system’s effectiveness. Parametric analysis reveals how measurement matrices, compression ratios, and block sizes affect accuracy. Results show that, compared to traditional imaging, the proposed architecture maintains equivalent accuracy while reducing data volume by 90% for Vision Transformers and computational workload by 60% for Convolutional Neural Networks. This low-data paradigm offers an efficient solution for industrial automation scenarios involving massive data streams, high acquisition costs, or constrained edge resources.

[CV-94] CoGeoAD: Hierarchical Color-Geometric Fusion with Multi-View Attention for Zero-Shot 3D Anomaly Detection ICML2026

链接: https://arxiv.org/abs/2606.25273
作者: Ke Xu,Xinle Wang,Yanning Hou,Xueliang Ma,Juan Xie,Jianfeng Qiu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:Zero-shot 3D anomaly detection is essential for industrial quality inspection, where labeled anomaly samples are scarce. Meanwhile, existing methods lack an effective mechanism to fuse complementary 2D color images with 3D geometric structures, limiting their ability to detect both surface and structural defects in a unified framework. To address these issues, we propose CoGeoAD, a unified CLIP-based framework that fuses color and geometric features by constructing pixel-aligned paired multi-view images. The framework introduces a Data-Driven Multi-View Attention (MVA) mechanism to adaptively aggregate 3D features and a Multi-Stage Color-Geometric Fusion (MS-CGF) module to hierarchically integrate multi-level features from both modalities. Extensive experiments on the MVTec3D-AD and Eyecandies benchmarks demonstrate that CoGeoAD achieves state-of-the-art performance, effectively capturing both structural and textural anomalies in complex industrial scenarios. our source code is available at this https URL.

[CV-95] Pre-Warm: Input-Conditioned Weight Initialization for Convolutional Neural Networks

链接: https://arxiv.org/abs/2606.25256
作者: Rowan Martnishn
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Pre-Warm, a simple yet effective zero-training-cost method for data-conditioned initialization of the first convolutional layer. Before the first forward pass, Pre-Warm extracts mean-centered local patches from a single training batch, clusters them with MiniBatchKMeans, applies inverse Manhattan spatial weighting, and uses the resulting centroids to initialize half of the first-layer filters (the remainder retain Kaiming initialization). We derive closed-form rules for all hyperparameters except a single insensitive scale parameter, though we derive a Kaiming parity bound on scale from patch dimensionality. For grayscale datasets we use Otsu’s foreground density; for natural color images we use the mean L2 norm of mean-centered patches. Both rules accurately predict the optimal patch count observed in grid search. Across five standard benchmarks – MNIST, Fashion-MNIST, CIFAR-10, SVHN, and CIFAR-100 – and 8-seed paired experiments, Pre-Warm yields statistically significant accuracy improvements over standard Kaiming initialization (p 0.05 on all datasets, p = 0.0007 on SVHN with 8/8 wins, p = 0.0033 on CIFAR-100 with 7/8 wins). The method adds negligible overhead, requires no architectural changes, and integrates into existing training pipelines with only a few lines of code. Pre-Warm demonstrates that even a lightweight, input-dependent signal can meaningfully improve optimization trajectories in modern convolutional networks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.25256 [cs.CV] (or arXiv:2606.25256v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.25256 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rowan Martnishn [view email] [v1] Wed, 24 Jun 2026 00:27:53 UTC (21 KB)

[CV-96] Cross-Modality Structural Guidance in 3D Latent Diffusion for Robust FLAIR Super-Resolution

链接: https://arxiv.org/abs/2606.25255
作者: Haoyu Lan,Jiazhen Zhang,John Onofrey,Bino Varghese,Nasim Sheikh-Bahaei,Arthur W. Toga,Jeiran Choupan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution (HR) MRI acquisition is often hampered by scan time constraints, resulting in anisotropic or low-resolution scans (e.g., thick-slice FLAIR) that limit diagnostic accuracy. While deep learning-based super-resolution (SR) methods show promise, they often hallucinate anatomical details, which can compromise brain structural integrity. To mitigate this limitation, we introduce MR-DiffuSR, a Multi-Resolution Diffusion-based Super-Resolution framework that incorporates HR T1w structural image priors to guide the restoration of thick-slice FLAIR scans and operates in the 3D latent space. Our architecture introduces cross-modality structural swin-attention, which derives structural attention maps from the HR T1w and applies them to the low-resolution FLAIR latent features. This design disentangles anatomical structure from modality-specific contrast, effectively preventing hallucinations. Furthermore, we employ a mixed-scale degradation strategy, training the model on a continuum of downsampling factors to ensure robustness to varying slice thicknesses, while optimizing with a DINOv3-based perceptual loss to preserve high-frequency semantic details. Evaluated on the ADNI-4 dataset, MR-DiffuSR surpasses both CNN and 2D diffusion approaches, achieving an average PSNR of 32.46dB, SSIM of 0.97, and LPIPS of 0.07 across all downsampling factors. In downstream white matter hyperintensity segmentation, our model demonstrates exceptional robustness. While baseline performance collapses at 10x down-sampling (Dice: 0.51), MR-DiffuSR maintains a Dice score of 0.63, preserving utility even at 7mm equivalent slice thickness.

[CV-97] OrthoTrack: Continuous 6-DoF UAV Trajectory Estimation Anchored in Public Orthophotos ECCV2026

链接: https://arxiv.org/abs/2606.25245
作者: Oussema Dhaouadi,Zuria Bauer,Johannes Michael Meier,Olaf Wysocki,Marc Pollefeys,Daniel Cremers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026 - Project page: this http URL

点击查看摘要

Abstract:Continuous 6-DoF pose estimation is essential for autonomous UAV operations. Yet, existing visual odometry and SLAM methods accumulate drift and yield only relative, up-to-scale trajectories. Single-frame geo-localization, in turn, discards temporal continuity and remains too slow for real-time use. We present OrthoTrack, a training-free system that estimates continuous 6-DoF UAV trajectories using only publicly available orthophotos and surface models as a map prior. OrthoTrack matches keyframes against the orthophoto and lifts correspondences to metric 3D via the surface model. It then propagates these map-anchored correspondences to intermediate frames with optical flow, producing absolute, metrically scaled poses at every frame without GPS or post-hoc alignment. We also introduce the MovingDrone Dataset, a large-scale benchmark pairing photorealistic UAV sequences with dense 6-DoF ground truth and co-registered multi-modal geodata including multi-temporal orthophotos. On MovingDrone and real-world benchmarks, OrthoTrack runs in real time on a single GPU. It outperforms all baselines by a large margin, even those receiving oracle scale and alignment. By relying on publicly available geodata, OrthoTrack enables deployment to new regions without site-specific adaptation.

[CV-98] Structuring Sparsity: Block-Sparse Featurizers Capture Visual Concept Manifolds

链接: https://arxiv.org/abs/2606.25234
作者: Thomas Fel,Matthew Kowal,Mozes Jacobs,Dron Hazra,Usha Bhalla,Lee Sharkey,Lucius Bushnaq,Satchel Grant,Tal Haklay,Thomas Icard,Can Rager,Michael Pearce,Daniel Wurgaft,Aiden Swann,Fenil Doshi,Siddharth Boppana,Curt Tigges,Nick Cammarata,Thomas Serre,Vasudev Shyam,Owen Lewis,Thomas McGrath,Jack Merullo,Ekdeep Singh Lubana,Atticus Geiger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:What is the geometry of a visual percept? The most widely used protocols for decomposing neural network representations into interpretable parts treat concepts as isolated directions, yet recent work shows that concepts are often realized as geometric structures in low dimensional regions of activation space. We turn to the literature of Structured sparsity to close this gap, and show that block sparsity, which groups directions into blocks, is the prior matched to a generative model in which a representation is a sparse sum of low-dimensional manifolds: the modern, learned form of a classical idea in visual neuroscience, where a visual feature is carried by a coordinated group of neurons rather than a single tuned one. We implement three variants of block-sparse featurizers (BSFs) and, through a minimum-description-length analysis, show that all three describe activations more compactly than direction-based featurizers, with the recovered concepts typically two- to four-dimensional. We then use BSFs to (i) recontextualize prior work, showing that curve detectors in InceptionV1 actually read from a single continuous curve manifold, (ii) discover novel manifolds including shadows and lighting in DINOv3, and (iii) support interpretable control of image generation in diffusion models (SDXL) via manifold steering.

[CV-99] Semantic Allocation in Ordered Bottlenecks: Predictive Residual Inference for Visual Representation Learning ICANN2026

链接: https://arxiv.org/abs/2606.25232
作者: Erik Ayari,Manuel Traub,Martin V. Butz
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICANN 2026 main proceedings. 12 pages, 5 figures

点击查看摘要

Abstract:Ordered bottlenecks aim to provide utility at flexible budgets by assigning coarse information to early tokens and task-relevant detail to later ones. Prior work, including tail dropping (TD), typically enforces ordering by means of a masking-based ordering pressure (MBOP): Late tokens are masked more frequently than early tokens and are therefore encouraged to store less essential fine details. We introduce predictive residual inference for ordered representations (PRIOR), a framework designed to address inherent weaknesses of MBOP. MBOP is prone to weak late-token utility because it lacks an explicit refinement objective and uses gradient exposure as a proxy for importance. Furthermore, representations may become particularly brittle in optimization-sensitive settings, such as when using discrete or quantized token representations. PRIOR replaces activation-rate control with log2-scaled levels and level-wise predictors. These predictors separate already explained from unexplained information, focusing each level on residual error. We compare PRIOR against MBOP-TD and independent tail-biased dropout (MBOP-ITD) in contrastive learning and image reconstruction tasks. Unlike the baselines, PRIOR learns well-ordered representations across experiments: low budgets provide coarse descriptors, while high budgets add refinements. Simultaneously, full-budget performance with PRIOR is higher in all but one experimental setting, where performance remains comparable. MBOP baselines are severely limited in discrete and quantized settings, while PRIOR approaches the performance of continuous counterparts. Taken together, these findings establish PRIOR as an effective framework for ordered representation learning.

[CV-100] MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

链接: https://arxiv.org/abs/2606.25225
作者: Revant Teotia,Adrien Bardes,Michael Rabbat,Sumit Chopra,Matthew J. Muckley,Nicolas Ballas
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality’s representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

[CV-101] Cage-based Texture Transfer with Geometric Filtering SIGGRAPH2026

链接: https://arxiv.org/abs/2606.25220
作者: Rose Mei Zhou,Lynnette Hui Xian Ng,Adrian Xuan Wei Lim,Conor Griffin,Faraz Baghernezhad
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to SIGGRAPH 2026

点击查看摘要

Abstract:Real-time texture transfer expands the creative horizon for interactive applications, enabling seamless detail projection in scenarios that range from digital character cosmetics to procedural automotive texturing. Yet, its practical application is governed by inherent trade-offs between processing speed and suppression of artifacts. Low-latency transfer methods frequently fail to suppress artifacts, and robust alternatives rely on large-scale models that are costly in training and memory. Our proposed method bridges the gap between efficiency and robustness by using a cage-based geometric filtering method to identify Non-Cosmetic Zones (NCZs) for artifact suppression. While other models are resource-intensive and require multiple days of training on manually annotated datasets, we are able to successfully suppress artifacts and achieve immediate deployment on consumer-grade hardware. Our framework achieved highly efficient runtimes of ~70ms on mobile devices for a ~4.8k triangle mesh.

[CV-102] Homomorphic Encryptions for Privacy Preserving Vision

链接: https://arxiv.org/abs/2606.25216
作者: Preey Shah,Rohan Virani,Sanjari Srivastava
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Legal requirements might prevent organizations from sharing sensitive data like medical or financial details of consumers which prevents them from leveraging cloud based ML-as-a-service solutions provided by third party providers, which are quickly gaining popularity these days. In this project, we aim to perform inference tasks in Computer Vision in a privacy-preserving manner, i.e, by only looking at encrypted data. Recent advances in fully homomorphic encryption make this possible. A fully homomorphic encryption allows an arbitrary sequence of additive and multiplicative operations to be performed on encrypted data directly. Applying homomorphic encryptions to CNNs requires modifying the conventional CNN layers, so that they adhere to the encryption scheme. Our aim was to explore the best methods to create CNNs which can classify encrypted images directly. We used Microsoft SEAL for performing homomorphic encryption. The performance of these “encryption based CNNs” should be comparable with baseline accuracies of the same CNNs trained on unencrypted data, and the aim was to achieve as low of a hit on inference-time performance as possible. We successfully obtained minimal drop in classification accuracy for various datasets. We used MNIST as our baseline, which is popularly used in related research work and then explored more complex datasets like Kuzushiji MNIST, Fashion-MNIST and CIFAR-10 as a part of our contribution. Additionally, we also added support for more complex operations on top of TenSEAL, like processing colored images (multi-channel input), applying multiple convolutional layers and performing average pooling.

[CV-103] Reflective VLA: In-Context Action Consequences Make VLAs Generalize

链接: https://arxiv.org/abs/2606.25215
作者: Qing Lian,Kent Yu,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Most vision-language-action (VLA) models are reactive: they predict the next action from the current instruction and observation, implicitly assuming that the current observation fully specifies the action-relevant state. In embodied control, however, embodiment-specific factors such as camera-to-robot geometry, robot calibration, or systematic actuation bias are often hard to identify from a single observation. As a result, reactive policies cannot reliably disambiguate these factors in general, overfitting to training environments and generalizing poorly at deployment. We propose Reflective VLA, which conditions each decision on a context of observation-action-consequence triplets. Each triplet records not only what the robot observed and executed, but also how the scene changed afterward, exposing the deployment-specific mapping from actions to observed effects. Architecturally, Reflective VLA routes all observation modalities through the VLM under shared attention, so the action expert reasons directly over past triplets and the current observation. A block-causal mask enables parallel multi-frame training without leakage and supports KV-cached real-time inference. On standard LIBERO and SimplerEnv-Bridge, Reflective VLA preserves strong in-distribution performance. Under distribution shift on LIBERO-Plus and the harder LIBERO-Plus-Hard, it improves average success rate by 5.4 and 4.2 percentage points over a matched reactive baseline. Ablations with a matched history-only baseline further show that action consequences – rather than additional context length alone – are the key to cross-environment generalization. Project page: this https URL

[CV-104] An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture leaf area index and plant height from Sentinel-1 and Sentinel-2 time series

链接: https://arxiv.org/abs/2606.25174
作者: Shubham Kumar Singh,Peilei Fan,Suraj A. Yadav,Rajendra Prasad,Prashant K Srivastava
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Field-scale retrieval of surface soil moisture (SM), leaf area index (LAI), and plant height (PH) is essential for precision agriculture, yet it remains an ill-posed inverse problem. Concurrent variations in soil moisture and canopy density generate substantial ambiguities in radar backscatter and spectral responses, which reduces the effectiveness of traditional feedforward regression models in heterogeneous smallholder cropping systems. This study presents the Iterative Energy-Based Transformer (iEBT) for the joint retrieval of coupled soil-canopy states from Sentinel-1 C-band SAR and Sentinel-2 multispectral time series. Instead of direct regression, iEBT embeds multi-modal predictors within a shared sequence, produces an initial state estimate, and iteratively updates the target [SM, LAI, PH] vector through normalized gradient descent to minimize a learned scalar compatibility energy function. Using 700 quality-controlled field measurements from Varanasi, India, iEBT achieved the highest learned-model performance on the random test split, with a four-seed mean R^2 of 0.854 \pm 0.012 (R_SM^2 = 0.841, R_LAI^2 = 0.905, R_PH^2 = 0.821). WCM and PROSAIL were retained as physically interpretable SAR and optical reference models for comparison. Modality ablations confirmed that Sentinel-1 drives SM retrieval, while Sentinel-2 dominates LAI, whereas PH relies on combined structural-phenological signatures. Crucially, the model’s terminal energy functions as an uncalibrated post-retrieval quality diagnostic; screening the 10% highest-energy samples markedly reduced target level root-mean-square errors. While leave-one-campaign-out validation highlights persistent cross-season domain shift challenges due to localized management variations, compatibility-guided multimodal fusion offers a structured self-diagnostic path toward reliable biophysical parameter estimation

[CV-105] oward Low-Latency Vision-Language Models with Doubly-Correct Predictions in Egocentric Visual Understanding IROS

链接: https://arxiv.org/abs/2606.25160
作者: Qitong Wang,Fan Du,Pranav Maneriker,Jihui Jin,Christopher Rasmussen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Intelligent Robots and Systems (IROS) 2026

点击查看摘要

Abstract:The rapid rise of Vision-Language Models (VLMs) in egocentric visual understanding has made low-latency inference in human-robot collaborative (HRC) tasks increasingly critical. Weight pruning techniques developed for VLMs to shrink model size and computation can be readily applied to satisfy the efficiency demands of on-board processing and real-time interactive robotics. Moreover, safe human-robot interaction demands pruning strategies that preserve doubly-correct predictions; outputs must be both accurate and evidentially grounded to mitigate risks and ensure user trust. In this paper, we present a new study of VLM pruning through the lens of doubly-correct prediction. Our experiments surprisingly show that existing pruning methods often preserve the right evidence localization but undermine correct prediction. To address this, we propose a rationale-informed pruning strategy that better aligns evidence with decisions. Benchmark results on egocentric video datasets demonstrate that our method not only achieves the highest prediction accuracy but also outperforms existing approaches in attaining doubly-correct predictions. We aim to stimulate research on efficient and reliable VLMs, ensuring accuracy-driven advances align with the transparency, auditability, and safety required for responsible human-robot interaction and embodied intelligence.

[CV-106] ADM-Fusion: Adaptive Deep Multi-Sensor Fusion for Robust Ego-Motion Estimation in Diverse Conditions

链接: https://arxiv.org/abs/2606.25111
作者: Hasan Moughnieh,Ibrahim Ghaddar,Hadi Elham,Imad H. Elhajj,Daniel Asmar
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Robust multi-sensor fusion is essential for reliable autonomy in diverse and degraded environments, where sensor reliability can fluctuate rapidly. Because different modalities fail in distinct ways, effective fusion should adaptively balance complementary cues rather than rely on fixed weighting. This adaptability is particularly important for ego-motion estimation, since accurate updates depend on the consistent integration of complementary sensor information. We propose ADM-Fusion, an end-to-end deep learning based multi-sensor fusion method designed to adapt to environmental changes and sensor degradation. ADM-Fusion employs an adaptive sensor mixture-of-experts framework with content-aware routing to dynamically assign weights to sensor inputs in real time. The system further incorporates separate translation and rotation branches, coupled through a cross-task attention mechanism to preserve task-specific specialization while enabling information sharing. ADM-Fusion is trained on the CARLA-LOC simulated dataset and subsequently fine-tuned on KITTI real-world data, demonstrating effective simulation-to-real transfer. Experiments show that ADM-Fusion remains robust under degraded conditions while maintaining competitive performance against existing methods.

[CV-107] Machine Learning Modeling for Real-Time Melt Pool Monitoring in Laser Powder Bed Fusion Additive Manufacturing: A Hybrid Approach

链接: https://arxiv.org/abs/2606.23851
作者: Inioluwa Emmanuel,Zhuo Yang,Ho Yeung,Xinyao Zhang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work investigates the implementation of artificial intelligence and machine learning (AI/ML) for real-time monitoring in laser powder bed fusion (LPBF) additive manufacturing. We developed a binary image classification framework for distinguishing normal and abnormal melt pool images using a balanced dataset of 1,200 images collected from Nickel superalloy 625 on the NIST AMMT platform. The study evaluates accuracy and inference time based on control requirements and hardware limitations of open-architecture LPBF machines. We benchmark three transfer learning architectures (ResNet50, EfficientNetB0, and MobileNetV2) against two Random Forest approaches: one trained on EfficientNetB0 feature embeddings (hybrid) and one trained on raw pixel features (baseline). Images are stratified into 80/20 train-test splits, with a further 90/10 validation split on the training set, and undergo standardized resizing, normalization, and label-preserving data augmentation to emulate realistic process variability. Each model is evaluated using accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC), along with training time, inference latency, and CPU GPU usage to capture deployability constraints relevant to factory-floor monitoring. The hybrid EfficientNetB0-plus-Random Forest approach achieves the best performance on the held-out test set, with an F1 score of 0.9451, accuracy of 0.9458, and AUC of 0.9904, while maintaining sub-millisecond per-image inference (1.15 ms). In contrast, purely deep learning models exhibit significantly higher inference times with lower accuracy. These results demonstrate that combining pre-trained convolutional features with classical ensemble methods provides a robust, computationally efficient route to real-time melt pool anomaly detection in data-limited additive manufacturing environments.

[CV-108] ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

链接: https://arxiv.org/abs/2606.23835
作者: Anindya Mondal,Sauradip Nag,Anjan Dutta
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Under review, webpage: this https URL

点击查看摘要

Abstract:ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

[CV-109] From Spatial to Spectral: An Efficient Frequency-Guided Feature Representation Learner for Small Object Detection

链接: https://arxiv.org/abs/2606.23825
作者: Yuhan Rui,Shihan Qiao,Yibin Lou,Mingxi Yu,Yutong Wan,Yanqiao Chen,Dongsheng Hou,Zhen Cao,Athena Zhuoming Zhong,Qi Hao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient small object detection is bottlenecked by the inherent feature scarcity of tiny targets, which is further aggravated by operations of spatial-domain detectors that indiscriminately discard critical high-frequency details. Recovering these fragile cues within the spatial domain is notoriously difficult, as it often requires computationally expensive architectural upscaling that inadvertently amplifies background noise. To bridge this gap, we propose a paradigm \textbfshift from spatial to spectral feature processing, introducing a holistic solution with the following novelty: (1) A versatile \textbfFrequency-Guided Feature Representation framework that generalizes across diverse detector architectures (both CNN and Transformer-based), offering a robust alternative to spatial-only feature extraction; (2) The unified \textbfDecompose–Enhance–Reconstruct (DER) operator, instantiated via three \textbflightweight, plug-and-play modules – Wavelet-Difference Gate (WDG), Log-Gabor Enhancer (LGE), and Frequency-Driven Head (FDHead) – to systematically inject frequency-aware modulation into the backbone, neck, and head. This mechanism decouples feature modeling from resolution reduction, capturing discriminative high-frequency components to enable accurate localization with significantly reduced parameter redundancy; (3) Extensive validation on multi-domain benchmarks (VisDrone2019, UAVDT, TinyPerson, DOTAv1) demonstrating consistent gains. Notably, our proposed \textbfDERNet series outperforms YOLOv11 models under the same scale while requiring \textbfonly 1/6 of the parameters, backed by rigorous spectral diagnostics and error decomposition analysis.

[CV-110] Listening makes Vision Clear for VLMs

链接: https://arxiv.org/abs/2606.23763
作者: Yiyang Chen,Yixin Tan,Binrui Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18pages,3 figures

点击查看摘要

Abstract:Recent work typically assesses vision–language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where language priors from previously generated answer tokens accumulate and mismatch with visual attention. Besides the priors from previous answer tokens, we find that structural tokens, e.g., modality boundary markers, may encompass the entire context and generate high attention to areas unrelated to the target. To avoid these distortions and provide consistency evaluation for large VLMs, we adopt prompt-side semantics and propose Prompt-Vision Token Activation Map (PV-TAM). PV-TAM further incorporates a filter to remove systematic bias induced by modality boundary markers. Unlike traditional methods that evaluate overlap solely through masks while ignoring activation intensity, our metrics leverage the peak distribution of attention to measure the alignment between prompts and visual regions. In experiments, PV-TAM consistently improves both attention-based and IoU-style localization metrics over answer-side baselines on various datasets.

[CV-111] Sol Video Inference Engine: Agent -Native Full-Stack Acceleration Framework for Efficient Video Generation

链接: https://arxiv.org/abs/2606.23743
作者: Yitong Li,Junsong Chen,Haopeng Li,Haozhe Liu,Jincheng Yu,Ligeng Zhu,Ping Luo,Song Han,Enze Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective acceleration strategy is highly instance-specific: a recipe that works well for one combination of model, hardware, and inference configuration often does not transfer to another. Different models vary in architecture, numerical sensitivity, and attention concentration patterns. Inference settings differ in spatial and temporal resolution and video duration, while hardware platforms differ in memory hierarchy, supported numerical formats, and kernel throughput. These factors create a large tuning space, making manual performance engineering costly. We present Sol Video Inference Engine, an agentic, native, training-free acceleration framework for video diffusion models. It organizes five broadly applicable techniques, cache, sparse attention, token pruning, quantization, and kernel fusion, into an agentic acceleration stack for instance-specific optimization. For a concrete deployment target defined by a model, hardware platform, and serving configuration, parallel skill agents optimize the implementation of each technique, an agent integrator composes them into a global acceleration stack, and a human validator provides feedback on generation quality. We instantiate this workflow on three video models with different sizes and architectures: 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. With little human effort, the full stack achieves more than 2x end-to-end acceleration while maintaining near-lossless VBench quality, demonstrating the effectiveness of the agent framework for video diffusion acceleration.

[CV-112] Systematic Exploration of 4-Expert Heterogeneous Mixture-of-Experts via Automated Pipeline Search

链接: https://arxiv.org/abs/2606.23739
作者: Yashkumar R Lukhi,Harsh Rameshbhai Moradiya,Radu Timofte,Dmitry Ignatov
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:We present an automated large-scale search pipeline for heterogeneous 4-Expert Mixture-of-Experts (MoE4) architectures within the LEMUR neural network dataset ecosystem. Building on a hand-crafted heterogeneous MoE reference model, we replace manual design with a deterministic code-assembly generator that systematically combines base architecture families drawn from the LEMUR database into MoE4 ensembles, each governed by a convolutional gating network with temperature scaling, mixup augmentation, and cosine-annealed learning rate scheduling. Over a 28-day campaign on an NVIDIA RTX 4090, the pipeline generated 4,463 candidate models across 197 batches, of which 1,021 were evaluated successfully. A critical finding emerged from the campaign: due to alphabetical enumeration via this http URL, the entire explored search space (4.8% of the theoretical 23,751 possible 4-family combinations) is anchored to a single family, AirNet. We characterise this coverage bias precisely, identify the root cause in the generator, and propose a stratified random sampling fix. Within the AirNet anchored scope, ShuffleNet and MobileNetV3 consistently co-produce the highest-accuracy ensembles (mean accuracy up to 0.632), while FractalNet and MNASNet are identified as low-yield families warranting exclusion in future campaigns. The pipeline, analysis artefacts, and corrected generator are released as part of the open-source NNGPT project at this https URL

[CV-113] Curvature-Guided Mixing for MLLM Adaptation ECCV2026

链接: https://arxiv.org/abs/2606.24963
作者: Jinglong Yang,Jiaxuan He,Wenjian Huang,Zhan Zhuang,Jianguo Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Fine-tuning Multimodal Large Language Models (MLLMs) on specialized tasks often leads to catastrophic forgetting of their general capabilities. Existing model merging methods to combat this are often heuristic or use sub-optimal objectives. We propose CurvatureGuided Mixing (CGM), a theoretically grounded framework that merges pre-trained and fine-tuned models. CGM formulates a joint optimization objective and uses a second-order (Hessian) approximation of the loss landscapes to analytically derive an optimal, closed-form “soft mixing” ratio. This ratio intelligently blends parameters based on their relative task-specific curvatures. We also introduce CGM \dagger , a robust “hard mixing” variant that performs sparse parameter selection guided by a novel, curvature-aware score. Experiments on LLaVA-1.5 and Qwen2.5VL across multiple downstream tasks show that CGM and CGM \dagger consistently improve the trade-off between task specialization and general knowledge retention over existing methods. Code is available at this http URL.

[CV-114] SEMIR: Topology-Preserving Graph Minors for Thin-Structure Segmentation ECCV

链接: https://arxiv.org/abs/2606.24935
作者: Luke James Miller,Yugyung Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:Thin-structure segmentation–power lines, cracks, lane markings at 1-3 pixel width–requires preserving connectivity that standard representations preclude: patching severs continuous structures and conventional superpixels merge thin targets into background before classification. Topology-aware losses penalize connectivity breaks at the objective level but cannot recover what the representation has already destroyed. We propose SEMIR, a framework that replaces the pixel lattice with a parameterized graph minor whose contraction map preserves thin-structure connectivity under the contraction criterion. The minor collapses millions of pixels into tens or hundreds of boundary-aligned supernodes, enabling full-resolution inference without patching at scales demonstrated up to 21 MP in this paper; a lightweight GNN classifies the reduced graph and an exact map lifts predictions to pixel resolution. One pipeline–identical architecture, features, loss, and GNN hyperparameters across all dataset–matches or exceeds domain-specific baselines on TTPLA (power lines), CrackSeg9k (pavement cracks), and SkyScapes Lane (aerial markings) on Dice, IoU, and Boundary F1 while reducing mask fragmentation by at least 4.6x relative to SLIC at matched inference.

[CV-115] FedReLa: Imbalanced Federated Learning via Re-Labeling

链接: https://arxiv.org/abs/2606.26037
作者: Guangzheng Hu,Patricia Menéndez,Feng Liu,Mingming Gong,Guanghui Wang,Liuhua Peng
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated learning has emerged as the foremost approach for decentralized model training with privacy preservation. The global class imbalance and cross-client data heterogeneity naturally coexist, and the mismatch between local and global imbalances exacerbates the performance degradation of the aggregated model. The agnosticism of global class distribution poses significant challenges for data-level methods, especially under extreme conditions with severe class absence across clients. In this paper, we propose FedReLa, a novel data-level approach that tackles the coexistence of data heterogeneity and class imbalance in federated learning. By re-labeling samples with a feature-dependent label re-allocator, FedReLa corrects biased global decision boundaries without requiring knowledge of the global class distribution. This modular, model-agnostic approach can be integrated with algorithmic methods to deliver consistent improvements without additional communication overhead. Through extensive experiments, our method significantly improves the accuracy of minority classes and the overall accuracy on stepwise-imbalanced and long-tailed datasets, outperforming the previous state of the art.

[CV-116] Hybrid deep learning-based phase diversity method for wavefront reconstruction

链接: https://arxiv.org/abs/2606.25855
作者: Y. Rodimkov,A. Kotov,K. Burdonov,S. Perevalov,V. Volokitin,I. Meyerov,A. Soloviev
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注: 13 pages, 10 figures. The following article has been submitted to Review of Scientific Instruments. After it is published, it will be found at this https URL

点击查看摘要

Abstract:The efficiency of high-power laser systems is limited by wavefront distortions in the beam, particularly non-common path aberrations, which reduce the peak intensity at the focal plane. Compensating for these aberrations requires the calibration of the adaptive optics system. Conventional calibration methods rely on a time-consuming iterative optimization that is highly sensitive to initial conditions. While deep learning-based models offer high speed, they often demonstrate insufficient accuracy. In this work, we present a hybrid wavefront reconstruction method that combines a convolutional neural network to generate an initial estimate of the wavefront distortions, with the L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) algorithm for its subsequent refinement. In numerical simulations, the method achieved an efficiency of \sim 0.99 in 80% of the cases for a root-mean-square (RMS) of wavefront distortions ranging from 0 to 1.3\lambda . In a physical experiment, for initial wavefront distortions with RMS values from 0.15 to 0.6\lambda , the method achieved an efficiency of \sim 0.75 . As a result, focusing with a Strehl ratio of 0.96 \pm 0.02 was attained within 2 to 4 iterations of the algorithm, confirming the applicability of the method for the fast and accurate calibration of adaptive optics systems under real experimental conditions.

[CV-117] Cross-Attention Multimodal Learning for Predicting Response to Neoadjuvant Imatinib in Gastrointestinal Stromal Tumors: A Multicenter Retrospective Study

链接: https://arxiv.org/abs/2606.25579
作者: Fariba Tohidinezhad,Douwe J. Spaanderman,Natalia Oviedo Acosta,Kaouther Mouheb,Karthik Prathaban,David F. Hanff,Dirk J. Grünhagen,Cornelis Verhoef,Joris M. van Sabben,Evelyne Roets,Jette J. Slettenhaar,Hans Gelderblom,Ingrid M.E. Desar,Anna K.L. Reyners,Neeltje Steeghs,Stefan Klein,Martijn P.A. Starmans
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Response to neoadjuvant imatinib in gastrointestinal stromal tumors (GISTs) is highly variable and cannot be reliably predicted using current clinical or molecular markers. This study developed and evaluated an explainable multimodal deep learning framework integrating computed tomography (CT) imaging and clinical variables to predict treatment response. Methods: Patients from four tertiary centers were retrospectively included between 2000-2023 in independent pretraining (n=935) and prediction (n=213) cohorts. A cross-attention framework integrating clinical variables and tumor-centered CT imaging was developed to predict response to neoadjuvant imatinib. Two training strategies were evaluated: (1) self-supervised pretraining with low-rank adaptation and (2) training from scratch. Hyperparameters were optimized using SMAC3. Performance was assessed through internal cross-validation and external testing. Ablation analyses and attention-based explanations were used to quantify modality contributions. Results: Among 213 patients (54.5% responders), responders had larger tumors (112 vs. 89 mm, P=0.026), higher mitotic index (3 vs. 0, P0.001), and more frequent KIT mutations (69.0% vs. 56.7%, P=0.019). Cross-attention models achieved the highest internal performance (AUC up to 0.99) but lower external performance (AUC 0.60-0.63). Clinical-only performance was moderate (AUC 0.66), whereas imaging-only models showed limited generalizability (AUC 0.56-0.66). Explainability analyses identified significant differences in feature importance between responders and non-responders, including CD117, BRAF, PDGFRA, age, sex, disease status, and comorbidities (FDR-adjusted P=0.036). Conclusion: The cross-attention framework shows potential for improving imatinib response prediction in GIST while providing interpretable insights into multimodal determinants of treatment response.

人工智能

[AI-0] InSight: Self-Guided Skill Acquisition via Steerable VLAs

链接: https://arxiv.org/abs/2606.24884
作者: Maggie Wang,Lars Osterberg,Stephen Tian,Ola Shorinwa,Jiajun Wu,Mac Schwager
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., “move gripper to the bowl”, “lift upward”, “pour the bottle”). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: this https URL.

[AI-1] OpenThoughts-Agent : Data Recipes for Agent ic Models

链接: https://arxiv.org/abs/2606.24855
作者: Negin Raoof,Richard Zhuang,Marianna Nezhurina,Etash Guha,Atula Tejaswi,Ryan Marten,Charlie F. Ruan,Tyler Griggs,Alexander Glenn Shaw,Hritik Bansal,E. Kelly Buchanan,Artem Gazizov,Reinhard Heckel,Chinmay Hegde,Sankalp Jajee,Daanish Khazi,Emmanouil Koukoumidis,Xiangyi Li,Hange Liu,Shlok Natarajan,Harsh Raj,Nicholas Roberts,Ethan Shen,Nishad Singhi,Michael Siu,Ashima Suvarna,Hanwen Xing,Patrick Yubeaton,Robert Zhang,Leon Liangyu Chen,Xiaokun Chen,Steven Dillmann,Saadia Gabriel,Xunyi Jiang,Anurag Kashyap,Boxuan Li,Yein Park,Minh Pham,Sujay Sanghavi,Lin Shi,Ke Sun,Yixin Wang,Zhiwei Xu,Erica Zhang,Siyan Zhao,Wanjia Zhao,Jenia Jitsev,Alex Dimakis,Benjamin Feuer,Ludwig Schmidt
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at this http URL to support future open research on agentic model training.

[AI-2] World Models in Pieces: Structural Certification for General Agents ICML2026

链接: https://arxiv.org/abs/2606.24842
作者: Yikai Lu,Yifei Wu,Xinyu Lu,Tongxin Li
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, camera-ready version in ICML 2026

点击查看摘要

Abstract:In the big-world regime, agents cannot be universally capable and their ability is inevitably specialized across a world model in pieces. Consequently, standard uniform guarantees fail to distinguish between the understanding of critical bottlenecks and irrelevant failures. We first formalize this limitation by proving that general agents are not universal, rendering standard worst-case analysis uninformative. To overcome this, we introduce structural certification, a transition-local framework that maps bounded goal-conditioned performance to entry-wise guarantees on the agent’s internal world model. Our main contribution is constructive. We provide algorithms that filter specific transitions using deep compositional goals and prove that a general agent on these goals has a structural world model with a \mathcalO(1/n) + \mathcalO(\delta) error bound. Conversely, this bound is tight in the small- \delta regime, whose existence is explicitly guaranteed by our certification. These results enable the certifiable deployment of general agents by localizing the specific transitions where long-horizon planning is reliable.

[AI-3] Grading the Grader: Lessons from Evaluating an Agent ic Data Analysis System

链接: https://arxiv.org/abs/2606.24839
作者: Tian Zheng,Kai-Tai Hsu
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Agentic data analysis systems produce rich outputs, including code, numerical results, and verbal diagnostics. This makes them more challenging to evaluate than single-turn LLM responses. It is therefore necessary to distinguish genuine disagreement between an agent’s output and a ground-truth answer from grading artifacts. We investigate how reliably automated graders assess such a system and what strategies improve grading quality by applying LAMBDA, a multi-agent data-analysis system, on 153 numerical QRData tasks from DSGym. We develop and evaluate a three-layer human-AI grading cascade: strict regex matching, LLM-based lenient grading, and snippet-based human inspection, which combines non-GenAI and GenAI strategies with different failure profiles. Both automated graders achieve 100% observed precision (0/70 false positives). The lenient grader’s recall is 97% against human labels. A keyword-anchored extraction pipeline raises the strict grader’s recall by 60 percentage points over a last-number heuristic; the lenient grader is architecturally parser-independent. An iterative nudge mechanism raises grading run success from 36% to 97% and lenient-pass rates from 16% to 46%; comparing nudging with and without original-question re-injection shows that re-injection offers no benefit, confirming the nudge as an answer template cue. We further observe in this case study that variable type is the task metadata field most consistently associated with grading pipeline dynamics and observed outcome grades.

[AI-4] Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment SIGDIAL2026

链接: https://arxiv.org/abs/2606.24834
作者: Ali Pourghasemi Fatideh,Wilder Baldwin,Maria Dhakal,Collin McMillan,Sepideh Ghanavati
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Accepted to SIGDIAL 2026 (27th Annual Meeting of the Special Interest Group on Discourse and Dialogue)

点击查看摘要

Abstract:LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional Requirements (NFRs), which are inherently vague, context-dependent, and involve many parts of a program. Evaluating how well these systems support collaborative reasoning about NFRs requires methods that go beyond single-turn accuracy to capture both the correctness of the system’s outputs and the quality of the multi-turn interaction. In this paper, we investigate the accuracy and quality of multi-turn conversations between developers and an LLM-based agent in the domain of Health Insurance Portability and Accountability Act (HIPAA) regulatory compliance. We hired 49 programmers to interact with GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, a system designed to comply with HIPAA regulations, across three dimensions: requirement satisfaction level, reasoning, and code localization. We find that developers tend to agree with LLM assessments, but accuracy against expert ground truth is low. We model user satisfaction and find that longer system responses and more information-providing turns negatively affect user satisfaction, whereas proactive interactions positively affect it. Our findings provide insights for designing LLM-based dialogue systems that support NFR assessment.

[AI-5] Difference-Making without Making a Difference

链接: https://arxiv.org/abs/2606.24832
作者: Sander Beckers
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Over a series of seven papers, Andreas Günther have introduced seven definitions of actual causation and have classified them as belonging to three different, competing, types of accounts: factual difference-making, counterfactual difference-making, and regularity-based. I show that their most recent - factual difference-making - definition instantiates all three types, thereby proving that these are distinctions without a difference. I further compare their novel account to the other six accounts on several crucial examples, revealing that this undermines all seven of their accounts.

[AI-6] Solving Inverse Problems of Chaotic Systems with Bidirectional Conditional Flow Matching

链接: https://arxiv.org/abs/2606.24824
作者: Peiyan Hu,Jian Zhang,Jiashu Pan,Ruiqi Feng,Tao Zhang,Zhi-Ming Ma,Yuan-Sen Ting,Gongjie Li,Tailin Wu
类目: Artificial Intelligence (cs.AI)
备注: 50 pages, 17 figures

点击查看摘要

Abstract:Modeling chaotic systems is crucial yet challenging. Inverse problems in chaotic dynamics, namely inferring initial conditions from final states, remain largely unsolved because of ill-posedness, non-uniqueness, instability, and potentially chaotic time-reverse dynamics. We address this open problem with Bidirectional Conditional Flow Matching (Bi-CFM), which learns bidirectional mappings between distributions of initial and final states to capture the stochasticity of chaotic evolution and mitigate exponential error accumulation over time. Furthermore, for systems with conservation laws, we extend it to Conservation-constrained Bi-CFM (CBi-CFM). Across the classic Lorenz, Circuit, and high-dimensional Lorenz 96 systems, Bi-CFM improves five distribution-level metrics over baselines while achieving a speedup of more than two orders of magnitude. In the three-body planet-planet scattering problem in planetary dynamics, CBi-CFM better respects conservation laws, with conservation errors comparable to those of the ground truth. Finally, on real observations of globular clusters, collisional million-body systems shaped by \sim 10^10 years (10 Gyr) of evolution, our method represents an advance in accuracy, establishing a scalable route to solving inverse problems of long-timescale real-world chaotic dynamics.

[AI-7] Grad Detect: Gradient-Based Hallucination Detection in LLM s ICML2026

链接: https://arxiv.org/abs/2606.24790
作者: Anand Kamat,Daniel Blake,Brent M. Werness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 2nd Workshop on Compositional Learning at ICML 2026, Seoul, South Korea. Copyright 2026 by the author(s)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet they remain prone to generating hallucinations. Detecting these hallucinations is critical for deploying LLMs reliably in high-stakes applications. We present Grad Detect, a gradient-based approach for predicting hallucinations by analyzing layer-wise gradient patterns from a single forward-backward pass during inference. Our method shows that the internal gradient structure of a model carries rich information about the correctness of its output. This information is not accessible through output-level signals alone. We evaluate Grad Detect on several QA benchmarks across both hallucination detection and model abstention prediction, where it consistently outperforms confidence-based and sampling-based baselines. Through comprehensive layer ablation studies across all eleven models from four architectural families, we find that the final five layers concentrate over 97% of the discriminative gradient signal, enabling efficient deployment with minimal performance loss. Grad Detect provides a unified framework for predicting multiple dimensions of LLM reliability, offering strong predictive performance alongside interpretable insights into where and how model failures originate.

[AI-8] BluTrain: A C/CUDA Framework for AI Systems

链接: https://arxiv.org/abs/2606.24780
作者: Adhitya Charan,Adwaid Suresh,Anuj Kumar,Aparna A,Dhanakumar K,Dharun M S,Dinesh G,Goutham Kumar Reddy K,Harshini V M,Jenifa D,Jona Delcy C A,Kathirvel S,Killi Uma Maheswara Rao,Kiruthik Kanna M,Kurra Vishnu Sai,Madhumithaa G K,Navin Kumar V,Ram Charan Golla,Revathi T,Rishikkanth R,Sanjay Krishna M V,Surendra Vendra
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Progress in deep learning is, at scale, more a matter of systems engineering than of modelling: the behaviour of a model in training (its throughput, its memory footprint, and the numerical fidelity of the result) is determined less by the architecture itself than by how that architecture is expressed on the hardware. To achieve absolute control over this hardware expression while abstracting away systems complexity to make modelling seamless and eliminating the need for repetitive orchestration logic, BluTrain was architected from first principles as a robust, lightweight, and architecture-general training framework in standard C++ and the core CUDA programming model. Every layer is implemented natively: a typed tensor module with reverse-mode autograd, a linear-algebra library, a caching allocator, a multi-mode distributed-execution module, and an MLIR-based deep-learning compiler. In formal evaluations training a 124M-parameter GPT-2 baseline in FP32 on an 8-GPU 6000 Ada system, BluTrain outperforms industry-standard baselines in both throughput (sustaining an average of 407K tokens/s versus PyTorch’s 395K tokens/s) and memory efficiency (achieving up to a 22% footprint reduction), while strictly preserving numerical fidelity and converging to a marginally lower final validation loss. With every layer explicitly open to native tuning, the performance ceiling is the framework’s own to raise.

[AI-9] Context-Aware Prediction of Student Quiz Performance with Multimodal Textbook Features

链接: https://arxiv.org/abs/2606.24770
作者: Samin Khan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Educational platforms often predict student performance from prior interactions, but the assessment content itself also varies in linguistic and visual complexity. This paper studies whether lightweight content features extracted from CourseKata chapter-review questions improve prediction of end-of-chapter quiz scores beyond a student’s average prior exercise performance. The study combines 2023 CourseKata student response data with chapter-level text features from review-question wording and image features from textbook visuals. Across 4,742 student-chapter observations from 562 class-student IDs, adding content features improves student-grouped five-fold quiz prediction performance by 9.1% relative to a prior-performance baseline. In leave-chapter-out validation, text features reduce prediction error relative to the baseline, while image-containing models have higher error. This paper suggests that a context-aware model adds useful signal about the text and visual features of questions to better predict student quiz performance compared with using past student performance alone.

[AI-10] Helpful or Harmful? Evaluating LLM -Assisted Vulnerability Patching via a Human Study

链接: https://arxiv.org/abs/2606.25973
作者: Giulian Biolo,Michael Tezza,Yuanjun Gong,Fabio Massacci
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Software vulnerability remediation is a cognitively demanding task that requires specialized security expertise often lacking in general developers. In the meantime, Large Language Models (LLMs) assisted tools show potential in vulnerability detection, location, and repair tasks. [Hypothesis:] While LLM-assistance is hypothesized to accelerate patching, it also risks introducing hallucinations or insecure code, leading to a higher likelihood of generating superficial repairs that bypass the standard functionality checks but fail the security validation. [Objective:] We aim to present an empirical experiment, unveiling the capability of LLM-assisted vulnerability patching compared to manual debugging on human participants in real-world scenarios. [Method:] We plan to conduct a controlled experiment using a Balanced Crossover design. For that, we have developed a WebApp for code execution and integrated hidden Ghost Tests to verify patch integrity beyond visible functional requirements. The experiment involves training and evaluation scenarios. The remediation speed, remediation efficacy for both standard functionality tests and security tests, and participant perception will be evaluated. [Pilot Study:] A pilot experiment with a small sample of participants has been conducted, providing insights for the following study.

[AI-11] WinDOM: Self-Family Distillation for Small-Model GUI Grounding

链接: https://arxiv.org/abs/2606.25964
作者: Chengheng Li-Chen,Zhiqian Zhou,Hao Chen,Nicolas Chauvin
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Small ( \sim 2B) GUI-grounding agents are attractive for on-device deployment, accessibility tooling, and low-cost iteration, but at this scale they face two open recipe questions: how to obtain bounding-box training data without expensive human annotation, and how to combine supervised fine-tuning with reinforcement learning. We address both, with the explicit goal of pushing small-model performance rather than scaling up. WinDOM is a 54,425 -record grounding corpus harvested by driving an open-source Windows 11 web reimplementation under headless Playwright, with bounding boxes read directly off the DOM and no OCR or human annotation. Self-Family Distillation (SFD) is a single rejection-sampling cold-start parameterised only by the teacher choice: either an EMA of the student (no external model) or a frozen larger same-family teacher. We then treat the saturation depth of the SFD cold-start as an explicit GRPO hyperparameter. On a Qwen3.5-2B student, the under-saturated cold-start is a better GRPO initialiser than the converged one: SFD-4B with Early-init RL gains +5.4 OOD-mean ( +3.5 ScreenSpot-Pro, +7.0 OSWorld-G, +5.8 ScreenSpot-V2) over the base. The same-size EMA mode lands within roughly one OOD-mean point of the cross-size 4 B variant ( 65.2 vs 66.3 ) without an external teacher.

[AI-12] Agent ic System as Compressor: Quantifying System Intelligence in Bits

链接: https://arxiv.org/abs/2606.25960
作者: Zihan Qin,Hongrui Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are turning from isolated predictors into agentic systems: they call tools, retrieve evidence, obey environment constraints, use verifiers, and complete tasks through search and multi-turn interaction. We adopts an analytical viewpoint based on “compression is intelligence”: under a fixed task distribution, interface, and compute budget, a stronger agentic system lets a target object be reconstructed with fewer bits. We operationalize the measure with arithmetic coding, seed coding, and a fallback, and evaluate it in five settings: reversed text, chess moves, protein sequences, retrieval-augmented question answering, and semantic story compression; in all of them agentic components reduce codelength. These small, controlled experiments cover component types typical of real agentic systems, show that codelength can analyze how components, observers, and budgets change residual uncertainty, and offer guidance for evaluating real agent systems.

[AI-13] AI-Assisted Computational Reproducibility on the FABRIC Testbed

链接: https://arxiv.org/abs/2606.25879
作者: Komal Thareja,Paul Ruth,Berent Aldikacti,Michael Zink
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational reproducibility remains difficult despite being central to scientific research. In this paper, we show how the international FABRIC testbed, combined with large language model (LLM) coding assistants through LoomAI, can simplify reproducing published experiments across multiple domains. We reproduced three case studies on FABRIC, covering BBR-family congestion-control evaluations, LAMMPS molecular dynamics scaling benchmarks on a CPU-only MPI cluster, and stress protein homeostasis genomics pipelines. Rather than focusing only on matching numerical outputs, we evaluate whether the reproduced experiments support the same scientific conclusions as the original studies. The AI assistant was effective in setting up the environment, adapting code, and debugging, but struggled with the analysis stages that lacked clearly defined workflows, which required human guidance to establish execution order and data dependencies. Across the case studies, the AI-assisted workflow reduced reproduction effort by roughly 4–6 times. We conclude with practical recommendations for improving AI-assisted reproducibility on research testbeds. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.25879 [cs.DC] (or arXiv:2606.25879v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.25879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-14] When CQs Go Wrong: Challenges in CQ Verification with OE-Assist ESWC

链接: https://arxiv.org/abs/2606.24619
作者: Anna Sofia Lippolis,Mohammad Javad Saeedizade,Robin Keskisärkkä,Aldo Gangemi,Eva Blomqvist,Andrea Giovanni Nuzzolese
类目: Artificial Intelligence (cs.AI)
备注: Acceted poster at this https URL 23rd European Semantic Web Conference (Satellite Event)

点击查看摘要

Abstract:Competency Questions (CQs) are the central component of CQ-verification, an established process in which an ontology is evaluated against a set of natural language questions to determine whether the intended purpose of the ontology has been properly modelled. However, CQ-verification is often time-consuming and error-prone, as it requires careful interpretation of linguistic nuances and precise alignment with formal ontology constructs. Ambiguities and complexity in CQs can further complicate this process, leading to inconsistent modelling decisions and verification outcomes. In this paper, we investigate what makes a CQ challenging and possible solutions to enhance the users’ performance in the CQ-verification process. We experimented with the data of 19 participants who performed CQ-verification on 20 tasks using an LLM assistant to support ontology evaluation. The results show the necessity of a tool to refine CQs before publishing them to avoid ambiguity or excessive complexity in later phases of the ontology engineering process.

[AI-15] Abstractions of Queries in Ontology-Based Data Access KR2025

链接: https://arxiv.org/abs/2606.24618
作者: Michel Leclère,Marie-Laure Mugnier,Guillaume Pérution-Kihli
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Extended version of a paper published in the proceedings of KR 2025

点击查看摘要

Abstract:In ontology-based data access (OBDA), multiple data sources are integrated via mappings to an ontology. We consider an OBDA setting based on existential rules and the certain answer semantics. We address the recent issue of query abstraction, which consists of abstracting data queries by translating them to the ontology layer. Since a perfect abstraction may not exist, the notions of minimally complete and maximally sound abstractions have been introduced. We study abstractions within an extension of UCQs with a limited form of inequality and a special predicate marking database constants. While this extension does not lead to an increased complexity of the problems of interest, it is able to express minimally complete abstractions, hence perfect abstractions when they exist. We also characterize maximally sound abstractions by making a new connection with the notion of maximum recovery stemming from data exchange. Comments: Extended version of a paper published in the proceedings of KR 2025 Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2606.24618 [cs.AI] (or arXiv:2606.24618v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24618 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.24963/kr.2025/43 Focus to learn more DOI(s) linking to related resources

[AI-16] AI Tokenomics: The Economics of Tokens Computation and Pricing in Foundation Models

链接: https://arxiv.org/abs/2606.24616
作者: Quanyan Zhu
类目: Artificial Intelligence (cs.AI); Performance (cs.PF); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Tokens have become the practical accounting unit for modern foundation model services, linking information processing, computation, memory use, energy expenditure, pricing, and economic value. This paper develops a framework for AI tokenomics: the study of how tokens are generated, consumed, priced, allocated, and optimized across AI systems. We connect token-level technical costs to workflow-level production functions, enterprise resource allocation, measurement and instrumentation methods, and emerging market-design questions. The framework shows that token expenditure and economic value are distinct: value depends on marginal productivity, workflow position, hidden reasoning activity, risk, and downstream propagation effects. The paper concludes by identifying open research directions in hidden-token measurement, empirical calibration, token productivity, dynamic allocation, and token-based markets.

[AI-17] ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling

链接: https://arxiv.org/abs/2606.24605
作者: Tianbao Ma,Chang Xi,Yichuan Zou,Chengen Li,Linxun Chen,Zilong Lu,Yanan Niu,Zhaojie Liu,Han Li,Kun Gai
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve reasoning reliability, ScaleToT constructs typed user-state chains with a bounded entropy-guided Tree-of-Thought (ToT) refinement procedure. To make this structured reasoning usable from sparse profiles, the teacher-curated chains are used to train a student model on static profiles through supervised fine-tuning (SFT) and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO). ScaleToT then transfers the student’s reasoning representations to a lightweight profile encoder, providing shared reasoning signals for the remaining users without LLM inference. We evaluate ScaleToT on lifetime value (LTV) prediction in a billion-scale advertising deployment. A randomized online A/B test increased LT30 by 6.738%, while offline reasoning covered only 7.32% of the potential population, greatly reducing compute cost compared with full-population reasoning.

[AI-18] Uncertainty-Aware Longitudinal Forecasting of Alzheimers Disease Progression Using Deep Learning

链接: https://arxiv.org/abs/2606.24604
作者: Arya Hariharan,Shreyank N Gowda,Anala M R
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Longitudinal modelling of Alzheimer’s disease progression is clinically useful only if it can describe not just the most likely next diagnosis, but how a patient may evolve over time and how reliable that forecast is. Most deep learning approaches reduce this problem to single-step classification, treating cognitively normal, mild cognitive impairment, and dementia as flat categories while providing limited insight into how uncertainty accumulates across future visits. We propose a probabilistic framework that combines ordinal diagnosis prediction, multi-horizon trajectory generation, and decomposed uncertainty estimation. A Temporal Fusion Transformer encoder is adapted with a CORAL ordinal output layer, asymmetric loss weighting, and converter oversampling to respect disease-stage ordering and improve sensitivity to MCI-to-dementia transitions. Conditioned on the learned patient-context representation, an autoregressive Mixture Density Network generates five-year probabilistic trajectories for diagnosis state, CDR Sum of Boxes, MMSE orientation, and hippocampal volume. On ADNI, the model outperforms linear, recurrent, and transformer baselines for next-visit diagnosis prediction, with the strongest gains on MCI-versus-dementia discrimination. Generated trajectories achieve near-nominal 90% credible interval coverage, widening uncertainty across the forecast horizon, and biomarker dynamics consistent with expected Alzheimer’s disease progression. We further separate aleatoric from epistemic uncertainty using analytic mixture variance and a five-member bootstrap ensemble, which provides the strongest encoder diversity and output-level epistemic signal. Epistemic uncertainty is higher for rare progression archetypes, MCI and dementia patients, and under external evaluation on OASIS-3, where it increases alongside prediction error.

[AI-19] ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

链接: https://arxiv.org/abs/2606.24601
作者: Anurag Akula,Satheesh K. Perepu,Abhishek Sarkar,Kaushik Dey
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at RLC 2026 conference

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target domains in MARL; however, the majority of existing approaches impose the constraint that the dimensionalities of the observation space and the global state space must be identical across domains. In this paper, we introduce a method that explicitly accommodates mismatched state-space dimensionalities between source and target domains. The proposed approach, ASALT, incorporates both observation-level and state-level adapters that map the target-domain observations and global states into a shared embedding space, thereby enabling more effective transfer of knowledge across both actors and critics. These adapters can generate embeddings that support efficient strategy transfer across heterogeneous domains. Experimental results on multiple configurations in standard benchmark environments demonstrate that ASALT surpasses existing baselines in terms of sample efficiency and global return in cooperative settings, but its effectiveness depends on the degree of mismatch between source and target domains. Furthermore, our findings indicate that ASALT mitigates negative transfer, which frequently constitutes a major obstacle when transferring policies between domains with differing observation and action spaces.

[AI-20] oward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines

链接: https://arxiv.org/abs/2606.24598
作者: Yimo Lin,Zhen Zhang,Yibin Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While expert-validated “LLM + script” workflows deliver significant value, they remain static: they encode hard-won domain knowledge yet fail to adapt execution based on feedback. Existing agent research predominantly targets greenfield agents and synthetic benchmarks, leaving the migration of active legacy workflows unresolved. To bridge this gap, we present a reversible, Strangler-Fig migration path that refactors legacy workflows into composable, typed, and auditable stages. Central to this framework is a three-tier convertibility taxonomy (A/B/C), implemented as a routing stage within the system harness, which diagnoses a workflow’s readiness and routes it accordingly.

[AI-21] LLM s Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLM s in Criminal Legal Context

链接: https://arxiv.org/abs/2606.24585
作者: Anastasiia Kucherenko,François Brouchoud,Dimitri Percia David,Andrei Kucharavy
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While the validity of LLMs’ use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistants, to assess the impact of overrefusal on legal prompts. Surprisingly, we find that authority-style prefixes (you are acting as an assistant of the national supreme court'', […] defense lawyer’') systematically increase refusal rates by 2–20x over the no-prefix baseline, while a known role-play jailbreak prefix shows mixed effects, sharply increasing refusals in some models and barely shifting them in others. The finding suggests that small on-prem deployable LLMs are unstable under contextual framings that a real institutional user might naturally introduce, and further investigation is essential to minimize opportunities for bias. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24585 [cs.AI] (or arXiv:2606.24585v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24585 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] Quant Convergence: Bridging Classical Value Investing and Modern Factor Models for Systematic Equity Selection

链接: https://arxiv.org/abs/2606.24575
作者: Augusto Eiji Yamazaki,Hugo Garrido-Lestache Belinchon
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern finance relies heavily on complex machine learning models to find patterns in the stock market. However, as these AI models get more complicated, they often memorize short-term market noise instead of finding companies with real, lasting value. We designed this research to test if Benjamin Graham’s classic value investing rules could act as a mathematical “low-pass filter” to keep these modern models in check. We built three different sets of features - pure Graham rules, modern market factors, and a mix of both - and tested them against highly complex models (XGBoost and AutoGluon) using 20 years of SP 500 data. By applying a strict buy-and-hold strategy over a four-year test period (March 2022 to March 2026), the results showed that more complex algorithms do not always win. While the AutoGluon model captured high returns (222.68%), it suffered a substantial 39.78% drop because it bought volatile tech stocks right before the market crashed. On the other hand, the pure Graham Random Forest achieved the highest overall return (232.13%) with much less risk (1.38 Calmar Ratio). Furthermore, the Combined Random Forest successfully mixed momentum with Graham’s rules, making a 202.91% return while keeping the lowest maximum drop (34.53%) of any model tested. Ultimately, this research proves that Graham’s “margin of safety” isn’t outdated; it is actually a highly effective way to prevent modern AI from taking on too much risk.

[AI-23] GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

链接: https://arxiv.org/abs/2606.24551
作者: Xiao Zhou,Siyue Zhang,Yilun Zhao,Jinbiao Wei,Tingyu Song,Arman Cohan,Chen Zhao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents and skill-mediated CLI agents receive identical goals, states, and final-state verifiers while being restricted to modality-native actions. In this controlled setting, the strongest GUI agent reaches a 59.1% full pass rate, outperforming the strongest original-skill CLI agent at 48.2%; however, verifier-guided skill augmentation raises CLI success to 69.3%, showing that much of the CLI deficit comes from incomplete skill coverage rather than model capability alone. These results suggest that GUI and CLI expose different execution bottlenecks: GUI agents are limited by reliable grounded interaction over long-horizon workflows, whereas CLI agents are limited by the coverage and scalability of their skill interfaces.

[AI-24] Governed Shared Memory for Multi-Agent LLM Systems

链接: https://arxiv.org/abs/2606.24535
作者: Yanki Margalit,Nurit Cohen-Inger,Erni Avram,Ran Taig,Oded Margalit
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent LLM environments require robust mechanisms for shared knowledge management. This paper formalizes the fleet-memory problem and identifies four foundational failure modes: unauthorized leakage, stale propagation, contradiction persistence, and provenance collapse. To address these, we define explicit systems-level primitives: scoped retrieval, temporal supersession, provenance tracking, and policy-governed memory propagation. These primitives are implemented in MemClaw, a production multi-tenant memory service, and evaluated via ArgusFleet, a reproducible harness testing four governance dimensions. Rather than a baseline comparison, this study measures a live production service, emphasizing real-world architectural insights and negative results. Key Evaluation Results Provenance: Successfully reconstructed 100% of depth-four derivation chains with correct writer identity at sub-second per-hop latency. Propagation: Demonstrated high intra-fleet visibility with zero cross-fleet leakage. Under strong write mode, write-to-visible latency was optimized to a single search round-trip. Production Architectural Issues Discovered Asymmetric Scope Enforcement: Tenant isolation held, but sub-tenant scope was initially bypassed on direct GET-by-id requests for agent-scoped credentials (disclosed and remediated during the study). Pipeline Ordering Conflict: While contradiction supersession works for admitted writes, a synchronous near-duplicate gate can prematurely reject contradictory writes before the asynchronous contradiction detector can evaluate them. Conclusion: Long-context retrieval alone is insufficient for production multi-agent memory. Governed shared memory demands explicit systems-level abstractions, and live evaluation is vital to expose enforcement and pipeline-ordering failures missed by design-only treatments.

[AI-25] A Fair Evaluation of Graph Foundation Models for Node Property Prediction ICML2026

链接: https://arxiv.org/abs/2606.24509
作者: Oleg Platonov,Gleb Bazhenov,Dmitry Eremeev,Liudmila Prokhorenkova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted at The Workshop on Graph Foundation Models at ICML 2026

点击查看摘要

Abstract:Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called GFMs, particular interest has been paid to GFMs designed for node property prediction tasks, which is one of the most popular settings in Graph ML with lots of real-world applications from fraud detection in financial and social networks to recommendation systems for e-commerce and user-generated content platforms. While a number of GFMs for this task have been recently proposed, the field has not converged to a unified evaluation setting, and different works evaluate their models in widely different ways, preventing reliable comparison of GFMs with each other and with other types of models. In this work, we conduct a fair and rigorous reevaluation of 9 recent GFMs for node property prediction, comparing them to strong Graph Neural Network (GNN) baselines. We find that, among these GFMs, only the most recent ones based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs in predictive performance, although at a higher inference cost.

[AI-26] Position Spaces and Graphs

链接: https://arxiv.org/abs/2606.25719
作者: Rita-Nathalia Assaf,Tom Davot,Frédéric Lardeux,Frédéric Saubion
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce position graphs, a graph-based reasoning framework based on the formalization of position spaces. This framework utilizes two strict partial orders, representing horizontal and vertical alignment and precedence, to model the relative positions of discrete tokens. Unlike general qualitative spatial calculi, position graphs are constrained by a chain condition and compatibility requirements that focus on rows and columns. We provide a comprehensive theoretical analysis of this representation, beginning with a characterization of graph consistency. Conditions to ensure the consistency of position graphs are established. Furthermore, we investigate the computational complexity of structural pattern discovery, modeled as the induced subgraph isomorphism problem. We demonstrate that this problem remains NP-complete even within the restricted class of position graphs. While initially motivated by document processing, this work focuses on the underlying mathematical properties and algebraic consistency of position-based constraints, providing a formal logical layer that is independent of specific data extraction techniques.

[AI-27] GUI agent : Guided Exploration of User-Sensitive Screens

链接: https://arxiv.org/abs/2606.25705
作者: Aradhana Nayak,Mussadiq Nazeer,Wang Peng,Feng Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are increasingly being used to automate tasks for users within an open GUI environment. They inevitably encounter screens containing user-sensitive information, for which takeover of task execution by the user is highly desirable or even necessary. State-of-the-art LLM-driven agents are usually fine-tuned to complete tasks regardless of the safety implications of their actions. This makes their real-world deployment difficult and adversely affects the reliability. Therefore, it is crucial to identify and categorize user-sensitive states and define user-sensitive queries. This dataset would be to engineers to recognize and request handover to the user in critical scenarios. This short paper develops an explorer agent that systematically explores the query space starting from one demonstrated task to identify queries that, if executed, would lead to user-sensitive states in a GUI environment.

[AI-28] Power-Budgeted Underwater Vehicle Control via Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2606.25680
作者: Yinuo Wang,Gavin Tao,Yuze Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Underwater vehicles operate from a fixed onboard energy budget that propulsion rapidly depletes, so a controller that completes its task while drawing less thruster power directly extends mission range and endurance. Reinforcement learning yields capable model-free controllers for station-keeping and trajectory tracking, but optimizing task accuracy alone drives the policy toward oscillatory, energy-wasting actuation. The established remedy subtracts an energy penalty from the reward, yet this sets the task-power trade-off through a single weight with no physical units: a target power level cannot be specified, the weight must be re-tuned for every vehicle and task, and a mismatched weight can even raise power. This paper instead formulates energy-efficient underwater control as a constrained Markov decision process in which average thruster power is subject to an explicit budget, solved with a PPO-Lagrangian algorithm. The power level is set by declaring a budget in physical units, and a single dual variable is updated online to meet it for each vehicle and task, without manual weight search. Across three vehicles and four tasks in the MarineGym simulator, the energy-constrained policy draws the least power in all twelve settings, reducing it by 14–65% (up to 64.9%) over a task-only baseline and below an energy-reward baseline everywhere, while remaining the smoothest in ten settings and preserving task accuracy except in one deliberately power-limited regime. Imposing energy as an explicit constraint thus offers a tuning-free route to energy-efficient underwater control that needs no per-vehicle, per-task weight search.

[AI-29] axonomy of Risks on Automated Fact-Checking Systems Considering its Propagation

链接: https://arxiv.org/abs/2606.25645
作者: Jun Yajima,Tatsuya Oka,Takao Okubo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, preprint

点击查看摘要

Abstract:In recent years, the posting of fake news including disinformation and misinformation on social networking services (SNS) has become a social problem. To combat this fake news, fact-checking that is the process of assessing the veracity of posts on SNS has become increasingly important. While fact-checking is currently performed by fact-checking organizations, it is difficult to fact-check all posts on SNS. Therefore, the use of automated fact-checking systems is effective. Recent automated fact-checking systems utilize artificial intelligence and large language models, so there are risks of incorrect judgments and posting incorrect results on social media which can lead to the spread of misinformation or to engage in defamation. In this paper, as a first step toward enabling the safe use of automated fact-checking systems, we categorize the specific risks on automated fact-checking systems. In this categorizing, we consider a three-stage risk propagation: risk factors, hazardous situations, and harm. Our analysis revealed that 32 specific risks exist in automated fact-checking systems. In this paper, we utilize the categorized risks as analytical cues (guide words) to present the risk assessment of the automated fact-checking system DEFAME. This assessment result indicates that risks that cannot be derived using STRIDE, a conventional IT security risk assessment method can be derived using our guide words.

[AI-30] L: Accuracy and Privacy Preserving Traversal Learning for Distributed Intelligent Systems

链接: https://arxiv.org/abs/2606.25627
作者: Erdenebileg Batbaatar,Young Yoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Distributed intelligent systems increasingly need to train across data silos without centralizing raw data. Federated learning keeps data local but can suffer under heterogeneous partitions and requires repeated full-model exchange. Split learning reduces communication through cut-layer activations, but standard protocols generally do not recover centralized mini-batch gradient behavior and may expose activations and gradients in plaintext. We present TL++, a two-mode traversal-learning framework that constructs virtual batches across nodes to recover centralized mini-batch gradient behavior under explicit synchronization assumptions. Base mode exchanges cut-layer activations and gradients rather than full models. Secure mode secret-shares each cut-layer activation and gradient between an orchestrator and a non-colluding helper, preventing either server from observing plaintext cut-layer tensors. This protection is limited to a semi-honest two-server setting; labels and loss-related outputs remain visible to the orchestrator. In the lightweight secure path evaluated here, exactness requires a linear or affine server path, while nonlinear operations require nonlinear MPC or approximation. We formalize TL++, analyze communication and computation costs, and evaluate it against federated and split-learning baselines on CIFAR-10 and BioGPT/PubMedQA using full fine-tuning and LoRA. On CIFAR-10, TL++ base cut 1 and exact secure cut 3 achieve accuracies of 91.41% (SD 0.19) and 90.93% (SD 0.17), respectively, exceeding the strongest measured non-TL++ baseline by more than 12 percentage points. TL++ base cut 1 also reduces per-step communication by 13.1-fold relative to full-model synchronization. PubMedQA results similarly favor TL++. Overall, TL++ approaches centralized-training performance while reducing communication and providing activation-level secret sharing.

[AI-31] Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz

链接: https://arxiv.org/abs/2606.25622
作者: Lea Roxanne Muth,Marian Margraf
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 2026 IEEE International Systems Conference (SysCon), Halifax, NS, Canada, April 6-9, 2026. 8 pages, 1 figure

点击查看摘要

Abstract:The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, this paper presents the technical implementation and empirical evaluation of a Multi-Agent System (MAS) architecture combined with Hybrid Retrieval Augmented Generation (HybridRAG) for the partial automation of IT-GS certification. We introduce two novel technical contributions to the MAS architecture to enforce the compliance rigor. The Hypothesis-Verification Loop in the Structural Analysis (SA) phase that cross-references agent-inferred dependencies against the Knowledge Graph to reduce hallucinations, and a Decoupled Reasoning Pipeline that separates agent-driven semantic extraction from the deterministic protection need inheritance. We utilize the BSI’s “RecPlast GmbH” case study as a human expert-generated reference data set for end-to-end evaluation of the architecture and to quantify Precision, Recall, and F1-scores. The performance of the system is investigated across the phases of SA, Protection Needs Assessment (PNA), Modeling, and IT-GS Check. The empirical results reveal noticeable differences throughout the different steps of IT-GS. While the MAS demonstrates high efficacy in semantic tasks (SA and Modeling), significantly reducing manual effort through automated information extraction, quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS. Comments: Accepted for publication at the 2026 IEEE International Systems Conference (SysCon), Halifax, NS, Canada, April 6-9, 2026. 8 pages, 1 figure Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.25622 [cs.CR] (or arXiv:2606.25622v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.25622 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/SysCon66367.2026.11503560 Focus to learn more DOI(s) linking to related resources

[AI-32] An Approach for a Supporting Multi-LLM System for Automated Certification Based on the German IT-Grundschutz

链接: https://arxiv.org/abs/2606.25608
作者: Lea Roxanne Muth,Marian Margraf
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 2025 IEEE International Conference on Cyber Security and Resilience (IEEE CSR), Chania, Crete, Greece, August 4-6, 2025. 8 pages, 2 figures

点击查看摘要

Abstract:This paper presents a novel approach to perform semi-automated BSI IT-Grundschutz certification using a MultiLarge Language Model system (MLS) with Hybrid RetrievalAugmented Generation (HybridRAG). Facing the challenges of the Network and Information Security Directive 2 (NIS2) directive, a shortage of specialists, and high implementation costs, our MLS architecture aims to increase efficiency, reduce costs, and support certifiers in maintaining the quality of security concepts while meeting the increased demand for certifications of newly affected companies. The system combines Large Language Models (LLMs) and Knowledge Graphs (KGs) to support different phases of the certification process, including protection needs assessment, modeling, IT-Grundschutz check, measure consolidation, and subsequent realization. Our architecture addresses the growing demand for security concepts and offers an approach to handle the digital security challenges introduced by NIS2.

[AI-33] Low-Complexity Policy Tessellations in Structured Markov Decision Processes

链接: https://arxiv.org/abs/2606.25593
作者: Fredy Pokou(CRIStAL)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study optimal-policy geometry in structured Markov decision processes. While approximate dynamic programming and reinforcement learning typically approximate high-dimensional value functions, we show that optimal policies induce simpler decision tessellations. We propose boundary-based policy approximations that learn policy regions directly. A policy-loss decomposition links performance degradation to action margins and explains why errors concentrate near indifference boundaries. Inventory control and queue admission experiments show lower policy error, smaller value gaps, faster error decay, and stability than reinforcement learning baselines.

[AI-34] STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity

链接: https://arxiv.org/abs/2606.25529
作者: Sitong Cheng,Weizhen Bian,Songjun Cao,Jin Li,Bei Liu,Chunyang Jiang,Yike Zhang,Weihao Wu,Yiming Li,Chi-Min Chan,Long Ma,Wei Xue
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech-to-speech translation (S2ST) should preserve not only lexical meaning, but also expressive attributes: emotion, scenario style (e.g., news reporting vs. dramatic dialogue), and nonverbal vocalizations (NVs). Moreover, collecting cross-lingual target speech that is both translation-faithful and expressively aligned with the source is difficult at scale, making reference-based evaluation impractical. We introduce STEB (Speech-to-Speech Translation Expressiveness Benchmark), a 32.6-hour Chinese–English benchmark that evaluates both standard dimensions (translation fidelity, speaker similarity, duration alignment) and expressiveness dimensions (emotion, scenario style, NV preservation). For expressiveness evaluation, STEB uses a caption-then-summarize framework that converts speech into structured expressive attributes and compares source and hypothesis attributes with an LLM judge. Human validation shows statistically significant correlations with listener judgments across all expressive dimensions. We evaluate six S2ST systems covering cascaded systems, end-to-end models, and speech large language models. Many systems, especially cascaded ones, achieve strong translation fidelity, but they still struggle with emotion preservation (best: 3.82/5) and NV preservation (best: 2.31/5). These results reveal a gap between semantic transfer and expressive transfer, identifying expressiveness preservation as an open challenge for S2ST. Audio samples are available at this https URL.

[AI-35] he impact of artificial intelligence on enterprise software user roles

链接: https://arxiv.org/abs/2606.25525
作者: Isabel Unger,Elizangela Valarini,Martin Schrepp,Nina Hollender,Gabriela Rocha,Erik Bertram
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Artificial Intelligence (AI) is rapidly reshaping the nature of work in software development, transforming user roles, workflows, and collaboration patterns across enterprise platforms. This qualitative study investigates how AI alters professional responsibilities within the context of SAP’s Business Technology Platform (BTP), combining expert interviews (n=20) and a participatory workshop (n=24). The results reveal substantial shifts in day-to-day tasks and roles in the development domain, characterized by increasing automation of operational tasks, expanding human-AI collaboration, and growing reliance on agentic AI systems. The study further identifies significant implications for existing user-role frameworks, such as the BTP User Type Matrix, which requires adaptation as the workforce is undergoing significant role specific changes. Collectively, these findings highlight a workforce landscape in transition and underscore the need for revised role taxonomies, new governance and oversight functions, and updated design approaches for AI-native enterprise software systems.

[AI-36] Quantization Inflates Reasoning : Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

链接: https://arxiv.org/abs/2606.25519
作者: Xinyu Lian,Walid Krichene,Beichen Huang,Masahiro Tanaka,Olatunji Ruwase,Li Zhang,Minjia Zhang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantization is widely used to reduce the inference cost of large language models, but its effect on reasoning models is not fully captured by final-answer accuracy or per-token latency. We show that low-bit post-training quantization can introduce a hidden test-time compute cost: quantized reasoning models often generate longer chains of thought even when they still answer correctly. Across mathematical reasoning, code generation, scientific question answering, and agentic tool-use benchmarks, we find that INT4/INT3 quantization can preserve accuracy but increase reasoning-token usage, offsetting the expected per-token speedup. To measure this effect, we introduce the CoT Token Inflation Ratio, which compares reasoning length between quantized and full-precision models averaged across all evaluation benchmarks. We further show that token inflation is accompanied by behavioral changes in the reasoning trace, including more intermediate steps and greater semantic repetition. These changes translate into measurable end-to-end real-world serving penalties. Finally, we evaluate mitigation strategies and find that prompting and decoding-time sampling offer inconsistent accuracy-length trade-offs, while quantization-aware training shows more promise in reducing both accuracy degradation and token inflation. Our results suggest that reasoning-token usage should be reported alongside accuracy when evaluating quantized reasoning models.

[AI-37] Learning with a Single Rollout via Monte Carlo Pass@k Critic

链接: https://arxiv.org/abs/2606.25451
作者: Fengdi Che,Yang Liu,Lei Yu,Meng Cao,Tong Che,Rupam Mahmood,Dale Schuurmans
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating token-level advantages in reinforcement learning (RL) for language models remains challenging because scaling up episodic experience collection is expensive. The difficulty intensifies for baseline advantage estimation methods, where repeated sampling causes trajectories to diverge into substantially different reasoning prefixes. In this context, RL algorithms such as GRPO prove limited: an outcome reward is too sparse to be attributed to specific actions like intermediate steps, and comparisons across sampled traces are non-trivial because they are heterogeneous. To mitigate both the computational cost of repeated sampling and the difficulty of credit assignment, we study single-rollout proximal policy optimization (SR-PPO) featuring token-level credit assignment in RL for language models. Instead of estimating advantages by normalizing episodic returns within the candidate group, we train a calibrated token-level credit critic using Monte Carlo outcomes from one rollout per prompt. Specifically, we use the critic to predict the Pass@k success probability at the prompt prefix, which is derived from a Pass@1 attempt. This choice yields a more selective learning signal than Pass@1: it discounts easily solved prefixes while prioritizing hard ones whose success probability remains marginal. We show that as k increases, Pass@k converges to a reachability indicator, reflecting whether a prefix can lead to at least one successful continuation. In an explicit state graph, the limit ( k \rightarrow \infty ) can be computed in O(|V|+|E|) time, offering a promising surrogate for direct credit assignment without the need to sample contrastive traces. As an initial validation, SR-PPO exhibits stable learning dynamics, along with consistent gains in Pass@128 success rates on mathematical reasoning benchmarks such as HMMT26 and AIME24.

[AI-38] opoCast: A Topological Fidelity Framework for Evaluating Transformer-Based Time Series Forecasting

链接: https://arxiv.org/abs/2606.25439
作者: Sandeepa Weerasekara,Sandareka Wickramanayake
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning-based models have achieved state-of-the-art performance in Time Series Forecasting (TSF), yet their evaluation remains dominated by pointwise error metrics such as Mean Squared Error (MSE), which quantify numerical accuracy but overlook structural properties of the forecast signal, including recurrent dynamics, oscillatory behavior, and phase alignment. As a result, forecasts exhibiting over-smoothing, phase shifts, or frequency distortions may achieve favorable error scores despite substantial structural degradation. To address this limitation, we propose TopoCast, a topology-driven framework for evaluating structural fidelity in TSF. TopoCast reconstructs phase-space representations of forecast and ground-truth sequences using Takens delay embedding and applies persistent homology to characterize their intrinsic dynamics. We derive four complementary topological fidelity measures from persistence diagrams and aggregate them into a Topological Fidelity Score (TFS). We further introduce dominant cycle overlap, a novel metric that maps persistent topological features to the temporal domain to assess whether dominant oscillatory patterns occur at the correct time points. Combined with TFS, this yields the Localized Topological Fidelity Score (LTFS), a phase-aware measure that captures temporal localization errors invisible to existing evaluation metrics. Experiments on five Transformer architectures across three real-world benchmark datasets demonstrate that models with similar forecasting errors can exhibit markedly different structural fidelity profiles, revealing failure modes overlooked by conventional evaluation and highlighting the value of topology-aware forecast assessment.

[AI-39] Interpretable Concept-Guided Polynomial Tabular Kolmogorov-Arnold Network for EEG-Based Mild Cognitive Impairment Detection

链接: https://arxiv.org/abs/2606.25434
作者: Yosef Bernardus Wirian,Qiang Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early and scalable detection of mild cognitive impairment (MCI) remains an unresolved clinical challenge. Existing EEG-based screening approaches are constrained by handcrafted feature pipelines that discard neurophysiologically meaningful domain structure and deep learning classifiers that sacrifice interpretability for performance. No existing work unifies physiologically organized concept encoders, cross-concept interaction modeling, and nonlinear tabular classification in a sleep EEG-based MCI detection framework. This study proposes Concept-guided Polynomial-transformed Tabular learning using Kolmogorov-Arnold Network (CPTabKAN), which maps heterogeneous EEG-derived features into domain-informed concept representations, expands them via degree-2 polynomial transformation to expose first- and second-order interactions, and applies a Fourier-parameterized TabKAN classifier to learn nonlinear decision boundaries. CPTabKAN was evaluated on the Study of Osteoporotic Fractures cohort (372 subjects, overnight polysomnography), using 1,379 features organized into ten physiologically motivated concept groups. Under 10-fold cross-validation, CPTabKAN-Second Order achieved a weighted F1-score of 0.9038 (SD 0.034), outperforming GradientBoosting by 5.65 percentage points (t(9)=1.934,p=0.043, one-sided paired test), with advantages persisting under SMOTE-based balancing. Ablation analysis confirmed independent contributions from each component. Concept importance analysis revealed that power spectral density, multi-scale entropy, and Hjorth parameters dominated first-order weights, while cross-concept interactions involving Lempel-Ziv-Welch complexity, statistics, demographics, and slow oscillations exceeded all first-order scores. These results demonstrate that concept-structured, interaction-aware tabular learning surfaces physiologically coherent reasoning, supporting clinical trust.

[AI-40] LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models ICML2026

链接: https://arxiv.org/abs/2606.25402
作者: Daniele Cipollone,Sergey Titov,Maliheh Izadi,Egor Bogomolov,Arie van Deursen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the DL4Code workshop at ICML 2026

点击查看摘要

Abstract:Large software projects often depend on older versions of libraries, even as APIs continue to evolve across releases. This creates a challenge for LLMs: they must maintain knowledge of multiple API versions, not merely the latest or most common one. However, current LLMs are trained on temporally mixed corpora and lack explicit mechanisms for such version-specific reasoning, leading to anachronistic errors - calling APIs as they exist in a different library version. To systematically evaluate this phenomenon, we introduce LibEvoBench, a multi-task benchmark spanning multiple versions of widely used Python libraries, along with a new metric, the Software Evolution Understanding Score (SEUS), to measure models’ consistency when working with evolving APIs. Our results show that state-of-the-art models are largely version-oblivious: performance degrades for evolving APIs, while for stable APIs it remains the same across versions. Moreover, simply specifying the target version provides no benefit, while relevant documentation significantly boosts models’ accuracy. These findings highlight a systematic limitation of current training paradigms and motivate new approaches for temporally grounded knowledge in code generation.

[AI-41] Lightweight PCGAE-Net: Parallel CrossGate Attention and Bottleneck AutoEncoder for Efficient 5G Channel Prediction

链接: https://arxiv.org/abs/2606.25401
作者: Uma Kishore Godavarti,K. Giridhar,Vanani Prince Dharmendrabhai,Anchit Panday,Madhan Raj Kanagarathinam
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, in review at IEEE GLOBECOM 2026

点击查看摘要

Abstract:Accurate channel state information (CSI) prediction is essential for proactive beamforming and resource management in 5G massive MIMO systems, yet the deployment of high-accuracy transformer-based predictors on base-station hardware remains challenging because the most capable models carry upwards of 30,M parameters. This paper introduces Lightweight PCGAE-Net, which addresses the efficiency problem not by post-hoc compression but by correcting two architectural flaws in the current state of the art. The first is a sequential attention ordering bias: in CS3T-UNet, group-wise temporal attention (GTA) always operates on features that have already been transformed by cross-shaped spatial attention (CSA), distorting what temporal information GTA can capture. We remove this dependency by routing both attention modules to the same layer-normalized input and combining their independent outputs through a learned per-channel sigmoid CrossGate. The second flaw is an uncompressed bottleneck: applying full self-attention at the deepest encoder stage, where channel depth reaches 4C , is quadratically expensive and carries redundant features. A Bottleneck AutoEncoder (BAE) with 1\times1 convolutions halves this depth and uses an auxiliary reconstruction loss to prevent information collapse. Wrapping these components inside a shallower encoder-decoder with frequency-domain dimensionality reduction ( N_f!=!32 , C!=!48 ) produces a model with just 8.54,M parameters – 58% fewer than the CS3T-UNet baseline – that outperforms it by up to 3.26,dB at 5,km/h and 6.0,dB at 9,km/h in single-step prediction on QuaDriGa dataset.

[AI-42] BrainAgent : A Large Language Model-Driven Multi-Agent Framework for Autonomous Brain Signal Understanding

链接: https://arxiv.org/abs/2606.25400
作者: Yangxuan Zhou,Sha Zhao,Jiquan Wang,Shijian Li,Gang Pan
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures

点击查看摘要

Abstract:Brain-Computer Interfaces (BCIs) and brain signal understanding are pivotal for clinical health and next-generation interactions. Despite this significance, its widespread adoption in real-world scenarios remains restricted, primarily because current analytical paradigms lack sufficient agentic intelligence. First, existing methodologies impose prohibitive technical barriers, requiring extensive specialized expertise. Second, they remain inherently static and task-specific, failing to execute the complex, long-horizon workflows essential for real-world deployment. To accelerate the democratization of brain signal understanding, we draw inspiration from Large Language Models (LLMs) to introduce BrainAgent, an LLM-driven multi-agent framework designed to ground abstract natural language intent into rigorous, executable, and end-to-end processing pipelines. BrainAgent employs a hierarchical architecture where a central supervisor orchestrates specialized sub-agents for adaptive task decomposition and execution. Furthermore, we establish a comprehensive, systematic benchmark for evaluating agentic systems in brain signal analysis. Empirical results demonstrate that BrainAgent effectively automates complex workflows with superior reliability, marking a paradigm shift toward democratized brain signal understanding.

[AI-43] Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

链接: https://arxiv.org/abs/2606.25396
作者: Kaicheng Shen,Lingyu Li,Wen Wu,Yan Teng,Liang He,Yingchun Wang
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures, 2 tables

点击查看摘要

Abstract:AI companions powered by large language models increasingly interact with cognition-developing users, including children and adolescents, creating risks that may accumulate over time. Existing safety evaluations largely rely on single-turn or short-session tests, which cannot capture risks that emerge only through prolonged interaction. To address this gap, we propose TSJ (Theater-Stage-Judge), a longitudinal framework combining persona-driven user simulation, dynamic psychological-state updating and retrospective evaluation. We evaluate six mainstream models across four developmental stages, twenty-four risk dimensions and three psychological-vulnerability personas, covering 12,960 simulated person-day interactions. TSJ shows that short-horizon testing systematically underestimates developmental risks, for which TSJ yields a stable risk estimate only after 140 turns within prolonged simulated relationships. Applying TSJ further identifies early childhood and emerging adulthood as the most vulnerable stages, with cognitive trust and emotional dependency as the weakest domains. TSJ provides a scalable methodology for longitudinal cognitive developmental risk evaluation in AI companion systems.

[AI-44] FactorLibrary: From Polynomials to Circuits via Recursive Subgoals ICML2026

链接: https://arxiv.org/abs/2606.25394
作者: Rohan Pandey,Michael Ruofan Zeng,Weikun K. Zhang,Kaijie Jin,Naomi Morato,Archit Ganapule,Bhaumik Mehta,Jarod Alper
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, in 3rd AI for Math Workshop (ICML 2026)

点击查看摘要

Abstract:Finding minimal arithmetic circuits for polynomials over finite fields is a combinatorially hard problem central to algebraic complexity theory. We formulate it as a reinforcement learning problem in two directions, bottom-up and top-down. To address the challenge of a fast-growing combinatorial search space, we introduce FactorLibrary, which stores factorizable subexpressions that serve as reusable subgoals across training episodes. We trained a bottom-up agent with Gumbel-PPO-MCTS and two top-down agents with PPO+MCTS and SAC. The PPO+MCTS top-down agent exhibited the most stable performance, finding certified optimal circuits up to complexity 8 with a success rate of 91.8% .

[AI-45] From Sounds to Scenes: A Benchmark for Evaluating Context-Aware Auditory Scene Understanding in Large Audio Language Models

链接: https://arxiv.org/abs/2606.25391
作者: Pengfei Zhang,Hoang H Nguyen,Kazi Shaharair Sharif,Yutong Song,Wenjun Huang,Henry Peng Zou,Pinxin Liu,Honghui Xu,Amir M. Rahmani
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent Large Audio Language Models (LALMs) have achieved remarkable progress in audio perceptual tasks across individual acoustic layers, including speech, sound, and music. However, existing benchmarks predominantly evaluate these layers in isolation, overlooking the complex contextual relationships that arise when multiple acoustic sources co-occur in real-world auditory scenes. Real-world auditory interpretation requires Context-Aware Auditory Scene Understanding (CASU): the ability to comprehend the holistic scene by integrating sound layers. To evaluate this capability, we introduce the CASU benchmark, which assesses whether Audio LLMs can interpret auditory scenes composed of speech, acoustic events (e.g., announcements), and background environments (e.g., traffic), and reason about the logical relationships between these layers. We propose a scalable pipeline for constructing time-accurate, semi-synthetic audio streams by composing real-world scene sounds with synthetic speech. Building on this data, we design four tasks that probe scene understanding: contextual question answering, entity extraction from the scene, speaker role inference, and counterfactual reasoning where scene is manipulated. Experiments across multiple LALMs demonstrate that effective auditory scene understanding requires integration over all auditory layers, rather than reliance on speech or sound alone, underscoring the necessity of CASU for advancing complex audio understanding in LALMs.

[AI-46] Offline Multi-agent Continual Cooperation via Skill Partition and Reuse ICML2026

链接: https://arxiv.org/abs/2606.25389
作者: Yuchen Xiao,Lei Yuan,Ruiqi Xue,Tieyue Yin,Yang Yu
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 12 figures, ICML 2026

点击查看摘要

Abstract:Extracting skills from multi-agent offline dataset improves learning efficiency via sharing task-invariant coordination skills among tasks. In settings where tasks occur sequentially and the space of skills grows exponentially, existing approaches that rely on heuristically designed and fixed-sized skill libraries struggle to resolve the problem of distributional shift and interference, facing catastrophic forgetting and plasticity loss. To address this problem and endow agents with the ability to continually discover and reuse coordination skills in open-environment, we propose COMAD, a principled framework for Continual Offline Multi-agent Skill Discovery via Skill Partition and Reuse. We first discover skills from mixed multi-agent behavior data with an auto-encoder to transform coordination knowledge into reusable coordination skills. Then we construct a skill-augmented policy learning objective with multi-head architectures, explicitly guiding the advantage function with reusable skills identified via a density-based reusability estimator. Theoretical analysis shows our method approximates the optimum of a continual skill discovery problem. Empirical results across diverse MARL benchmarks show that COMAD continually expands its skill library to mitigate interference, achieving superior forward and backward transfer for task streams compared to multiple baselines.

[AI-47] What Actually Works for Spacecraft Fault-Tolerant Control: An Honest Settled-Gate Benchmark of Learned and Classical Methods

链接: https://arxiv.org/abs/2606.25374
作者: Alireza Shojaei
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent learned fault-tolerant-control (FTC) work reports high success on spacecraft actuator faults, but often in simulation, on narrow fault sets, and with transient metrics that a trajectory need only touch once. We ask what recovers spacecraft pointing when success means holding it on faults never seen in training. We answer with a benchmark built around a settled gate, pointing held within 0.2 deg over a dwell window and scored on the true state, train/test splits disjoint in inertia, gain, sign pattern, and bias, Wilson intervals over n=500 episodes per cell, and one-command reproduction on a 6-DOF Basilisk testbed. Across classical, adaptive, learned end-to-end, and structured controllers, three findings stand out. Fault-unaware PD/PID and from-scratch end-to-end RL score 0%, so learning capacity alone is not the lever. Classical adaptive laws resolve sign faults but handle gain poorly at 55.2%, and a literature-faithful Nussbaum-gain law reaches 45.2% and 3.2%. A structured estimate-then-control design, with a learned recurrent module that infers actuator gain online and feeds an analytic law, wins on sign and gain faults at 97.8% and 94.4%, approaching the privileged oracle while unstructured methods remain at zero. The hard wall is constant additive bias, which is 0% for every controller including the privileged gain oracle, because an integral-free law cannot null a constant disturbance. We close it with a disturbance observer that recovers bias from the dynamics and is self-correcting for gain-estimate error. Composed with the gain estimate, it recovers 59.4% of held-out bias faults with no sign/gain regression, moving that class off zero. We classify sensor-fault regimes similarly, show that sensor bias is unobservable from the corrupted measurement alone and therefore requires fusion rather than an observer, and release the benchmark so the gate is shared.

[AI-48] Conformal Recovery-Deadline Certificates for Runtime Assurance of Adapting Controllers

链接: https://arxiv.org/abs/2606.25371
作者: Alireza Shojaei
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Runtime assurance (RTA) protects a safety-critical system by switching from an advanced controller to a verified safe controller when a monitored condition is violated. The standard latching rule, which trips on the first breach of the safe set and then coasts, is correct for a diverging controller but pathological for a capable online-adapting one. Such a controller is unsafe by design during a bounded recovery transient. It must excite the plant to identify the fault before it can correct it, so a latching shield trips on that transient and suppresses a controller that would have recovered. We introduce the conformal recovery-deadline certificate, a split-conformal, distribution-free, finite-sample upper bound on the adapting controller’s recovery time that licenses delayed fallback with a coverage guarantee, backstopped by a verified monitor at a hard critical limit. The certified deadline discriminates capable from incapable controllers, keeping the recoverer autonomous while catching the diverger. The construction separates autonomy, governed by statistical coverage, from safety, governed by the verified backstop, as an instance of reliability-asymmetric design. We prove marginal coverage, a weighted extension that restores coverage under a known fault-distribution shift, and group-conditional Mondrian coverage. We demonstrate all three on two unrelated Simplex testbeds: a 6-DOF spacecraft attitude controller and a torque-controlled inverted pendulum. Both show the same suppression pathology and the same cure, making the certificate a domain-general mechanism rather than a single-system trick.

[AI-49] Reliability-Asymmetric Spacecraft Autonomy: Co-Designing a Capable Learned GNC Stack with a Verified Adaptation-Aware Runtime Shield

链接: https://arxiv.org/abs/2606.25366
作者: Alireza Shojaei
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep-space missions need onboard autonomy that is both capable and certifiable. Rule-based autonomy is certifiable but brittle, while learned autonomy is capable but hard to verify. We present AMPLE-GNC, a three-tier guidance, navigation, and control stack. Its capability path combines a small foundation-model commander that maps natural language to PDDL+, a constraint-screening verifier, and a fault-adaptive controller. All three are bounded by a runtime shield with nine linear-temporal-logic invariants whose predictor soundness is machine-checked by the Kind 2 model checker. On a 6-DOF Basilisk testbed, we make three contributions. First, we deploy an edge commander. Fine-tuning a pretrained 360M model with grammar-constrained decoding gives a hard output-validity guarantee and 84% planner-executable actions. On a de-leaked test, novel-phrasing generalization is 38% exact and 51% action, rising to 48% exact after phrasing-diversity re-finetuning; we separate syntactic validity from semantic accuracy. Second, we introduce a fault-adaptive controller. Rapid Motor Adaptation infers latent actuator faults online and recovers 97.8% of actuator-sign faults and 94.4% of continuous-gain faults within the training randomization envelope. Fault-unaware PD and from-scratch end-to-end RL both score 0%, while the strongest classical-adaptive baseline reaches 55% on continuous gain. Beyond the envelope, a split-conformant retrain scores 57-67%, and adding 4x more in-regime data worsens performance, showing that randomization breadth, not data volume, drives generalization. Robustness is flat under star-tracker noise to 0.005. Third, we show that a latching safe-hold shield can suppress even a capable controller. A split-conformal recovery-deadline certificate with adaptation-aware engagement reconciles safety and recovery, keeping the controller 94.5% autonomous while still catching non-recovery.

[AI-50] Compositional Behavioral Semantics for State Abstraction in Reinforcement Learning

链接: https://arxiv.org/abs/2606.25357
作者: Yivan Zhang,Ziyan Luo,Manuel Baltieri
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Category Theory (math.CT)
备注: International Conference on Machine Learning 2026

点击查看摘要

Abstract:State abstraction plays a key role in scaling reinforcement learning to complex but structured systems. In studying such systems, a wide range of behavioral structures have been studied in reinforcement learning, including value functions, invariants, bisimulation relations, and behavioral metrics. However, a general principle for determining what structures are provably preserved under state abstraction is still lacking. In this paper, we present a unified framework for defining and analyzing behavioral structures in reinforcement learning. Our framework provides a compositional way to specify behavioral semantics based on local, one-step descriptions of system dynamics. Using this framework, we establish results showing how behavioral structures can be safely transferred between abstract and concrete systems. We further show how to construct quantitative metrics from logical behavioral semantics with soundness guarantees. Together, these results provide a principled foundation for reasoning about behaviors under state abstraction in reinforcement learning and offer reusable definition and proof principles for a broad class of behavioral structures in reinforcement learning.

[AI-51] Decoupling Reconnaissance and Exploitation: Measuring the Capability Boundaries of LLM -Based Web Penetration Testing

链接: https://arxiv.org/abs/2606.25332
作者: Liwei Yu,Shuo Li,Ming Zhou,Ge Chu,Yan Guo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise for automated penetration testing, yet existing end-to-end black-box evaluations are highly susceptible to error cascading: failures in early reconnaissance can mask an agent’s actual ability to exploit vulnerabilities. To more accurately characterize these capabilities, we propose a two-stage decoupled evaluation framework that separates exploit execution from reconnaissance. Using ground-truth injection and knowledge-driven ablation across 70 high-fidelity web vulnerability testbeds, our framework isolates exploitation performance from reconnaissance noise. We empirically evaluate five open-source penetration-testing agents, covering multiagent, monolithic, and graph-driven architectures, on a strictly aligned subset of 50 representative vulnerabilities. The results reveal a substantial capability gap. With accurate vulnerability context, agents achieve a functional success rate of up to 90.0%, whereas autonomous reconnaissance, measured by targeted vulnerability recall, plateaus at approximately 50.0%, primarily due to failures in parsing unstructured telemetry. Cross-architectural analysis further reveals distinct capability niches: multi-agent isolation is more effective for long-sequence interactions such as de-serialization, while monolithic and graph-driven designs perform better on short-chain injections and cross-session access-control vulnerabilities, respectively. This decoupled evaluation work provides a fine-grained benchmarking protocol and an empirical basis for designing next-generation automated offensive security agents.

[AI-52] Supervised Post-training of Speech Foundation Models for Robust Adaptation in Speech Deepfake Detection

链接: https://arxiv.org/abs/2606.25328
作者: Zihan Pan,Sailor Hardik,Jinyang Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large speech foundation models have shown strong potential for speech deepfake detection, but direct fine-tuning is limited by a mismatch between self-supervised pre-training objectives and spoof-specific artifacts. To address this, we propose a mix-frame post-training strategy to create localized spoof-oriented perturbations and use frame-level supervision to encourage the SSL model to learn local inconsistencies that are critical for robust spoof detection. On ASVspoof5, we achieve state-of-the-art EER 4.50% for a single model without data augmentation. On ASVspoof2021 LA/DF, it further achieves only 0.16% absolute EER gap between LA and DF, indicating strong and balanced robustness across distinct distortion conditions. These results show that supervised post-training provides an effective and practical way to adapt speech foundation models for robust deepfake detection.

[AI-53] Omni-Perception Policy Optimization for Multimodal Emotion Reasoning ICML2026

链接: https://arxiv.org/abs/2606.25325
作者: Zhiyuan Han,Beier Zhu,Wenwen Tong,Pengyang Shao,Peipei Song,Xinyi Wang,Jiangnan Chen,Lewei Lu,Xun Yang
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:We find that current emotion-oriented Omni-MLLMs still lack reliable omni-modal perception: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we propose OPPO (Omni-Perception Policy Optimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduce MEP-Bench, a diagnostic benchmark that quantifies utilization and faithfulness. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and MME-Emotion, while substantially improving utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning.

[AI-54] Communicability-Inspired Positional Encoding (CIPE)

链接: https://arxiv.org/abs/2606.25293
作者: Yipeng Zhang,Zhongtian Sun,Pietro Liò,Kelin Xia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, 3 tables; supplementary material includes additional experiments and theoretical proofs

点击查看摘要

Abstract:Positional encodings (PEs) are essential for Transformers. Yet designing effective PEs for non-Euclidean graphs remains challenging. Such encodings should ideally induce an Attention-Compatible Geometry for self-attention: not merely describing graph structure, but defining a geometry whose inner products reflect meaningful structural relatedness. To realize this geometry, we propose Communicability-Inspired Positional Encoding (CIPE), built from communicability, a measure between pairs of nodes that aggregates contributions from paths of all lengths. By construction, CIPE inner products recover communicability, converting global multi-path connectivity into an attention-ready similarity geometry. For practical Transformer training, we introduce dimensionality alignment, mapping graph-size-dependent CIPE representations to prescribed dimensions while faithfully preserving the induced geometry. Empirically, CIPE improves structure-agnostic Transformers by 35.5% on average across seven benchmarks, outperforming representative PEs; it also consistently improves structure-biased graph Transformers, where competing PEs often yield only marginal benefits. These results position CIPE as a principled framework for attention-compatible graph positional encodings.

[AI-55] EPTS: Elastic Post-Training Sparsity for Efficient Large Language Model Compression KDD2026

链接: https://arxiv.org/abs/2606.25285
作者: Ke Xu,Jiaqi Wan,Wenhao Hu,Han Pu,Xiaoyun Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: KDD 2026

点击查看摘要

Abstract:Post-Training Sparsity (PTS) has emerged as a crucial paradigm for compressing Large Language Models to facilitate efficient deployment on resource-constrained devices. However, existing PTS methodologies are typically confined to Single-Sparsity optimization, necessitating a separate, time-consuming optimization session for each specific sparsity level. This rigid paradigm significantly hinders flexible deployment across diverse hardware scenarios, as adapting to a new sparsity requirement mandates a complete re-optimization process. To address these limitations, we propose Elastic Post-Training Sparsity (EPTS), a unified Multi-Sparsity framework that produces a single elastic model capable of maintaining robust performance across diverse sparsity configurations through a one-shot optimization process. Specifically, we design a Multi-Sparsity Hierarchy LoRA (MS-HiLoRA) mechanism that facilitates knowledge inheritance from low- to high-sparsity groups, effectively mitigating the competition for parameter reconstruction. Furthermore, we introduce a Multi-Sparsity Feature Mixer (MSFM), which significantly enhances the model’s adaptability to pruning perturbations by dynamically fusing feature representations of varying sparsity granularities. Extensive experiments on LLaMA and OPT families demonstrate that EPTS achieves competitive performance compared to state-of-the-art methods like SparseGPT and Wanda, while offering significant efficiency gains by enabling multi-scenario deployment from a single optimization. our source code is available at this https URL.

[AI-56] UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control

链接: https://arxiv.org/abs/2606.25274
作者: Xibai Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time-series models are usually scored as forecasters, yet deployed systems often require delayed decisions under uncertainty and hard feasibility constraints. UC-Search is a model-agnostic test-time wrapper: a backbone emits forecasts or action scores, a feasibility automaton rolls candidate paths forward, and bounded search returns the first action of a risk-adjusted feasible trajectory. We instantiate UC-Beam and a UCT-style UC-MCTS diagnostic, using epistemic, aleatoric, and propagated uncertainty mainly as path-risk terms. A myopic-collapse/separation theorem states when search reduces to one-step risk-greedy and when delayed feasible-set coupling can create non-myopic value. Primary evidence comes from a predeclared public 9 -family, 33 -series delayed-control suite with six held-out starts per series: UC-Pareto is positive versus validation-selected CEM, MPPI, and risk-aware random at the normalized threshold ( +3.1675/+2.3328/+2.5038 ), and remains positive in a compute-matched audit ( +2.8466/+2.7418/+2.7429 ). ETT/LTSF delayed-inventory validation supports the same compute-frontier claim. A 48-series raw M4 standard periodic-review lost-sales inventory audit is positive versus the strongest classic base-stock control ( +13556.7547 ), CEM ( +64900.2207 ), and risk-random ( +52881.6042 ), while MPPI remains family-mixed. FI-2010, official-forecast adapters, SB3/FQI controls, direction/capacity/intervention checks, and synthetic mechanism tests are reported as boundary or mechanism evidence rather than broad dominance claims.

[AI-57] FDN: Interpretable Spatiotemporal Forecasting with Future Decomposition Networks

链接: https://arxiv.org/abs/2606.25201
作者: Nicholas Majeske,Ariful Azad
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Spatiotemporal systems comprise a collection of spatially distributed yet interdependent entities each generating unique dynamic signals. Highly sophisticated methods have been proposed in recent years delivering state-of-the-art (SOTA) forecasts but few have focused on interpretability. To address this, we propose the Future Decomposition Network (FDN), a novel forecast model capable of (a) providing interpretable predictions through classification (b) revealing latent activity patterns in the target time-series and © delivering forecasts competitive with SOTA methods at a fraction of their memory and runtime cost. We conduct comprehensive analyses on FDN for multiple datasets from hydrologic, traffic, and energy systems, demonstrating its improved accuracy and interpretability.

[AI-58] A Hybrid CNN-LSTM Intrusion Detection Framework for Cybersecurity in Smart Renewable Energy Grids

链接: https://arxiv.org/abs/2606.25200
作者: Sajib Debnath,Remon Das
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The accelerated digitalization of renewable energy smart grids through IoT sensors, AMI, and SCADA systems has significantly expanded the attack surface for sophisticated cyberattacks, FDI attacks that stealthily distort state estimation and DoS/DDoS attacks that flood communication channels. Current IDS, however, exhibit three inherent limitations: inadequate modeling of the temporal progression of multi-step attacks, degraded scalability under extremely skewed class distributions of standard benchmark datasets, and restricted generalization across heterogeneous network environments. In this study, we present a Hybrid CNN-LSTM IDS that jointly exploits CNN-based spatial feature extraction and LSTM-based temporal sequence modeling, enabling the detection of instantaneous volumetric anomalies and gradually evolving low and slow-attack campaigns in real time. The model was trained using a seven-step preprocessing workflow comprising missing-value imputation, min-max normalization, one-hot encoding, SMOTE class balancing, mutual-information feature selection, causal temporal sequence construction (T=10), and stratified partitioning. LSTM (96.1%), Random Forest (93.5%), SVM (91.2%) and KNN (89.7%); in NSL-KDD, it reaches 98.2% precision versus 96.4% (LSTM), 95.2% (CNN), 92.7% (Random Forest) and 90.8% (SVM), with margins of 2-9 percentage points in all measures. An ablation analysis identified SMOTE balancing as the most influential design choice (-3.7~pp F1 without it). The model achieves a real-time inference throughput of 27,800 flows/s on GPU and 0.082 ms/sample CPU latency in FP32, with INT8 quantization providing an additional 3.1 x speedup at 0.3% accuracy loss, confirming deployment feasibility on resource-constrained IEDs with 128MB memory and establishing a deployable deep-learning framework for securing next-generation renewable energy smart grid infrastructure.

[AI-59] Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality Diversity and Novelty

链接: https://arxiv.org/abs/2606.25198
作者: Antonis Antoniades,Deepak Nathani,Ritam Saha,Alfonso Amayuelas,Ivan Bercovich,Zhaotian Weng,Vignesh Baskaran,Kunal Bhatia,William Yang Wang
类目: Artificial Intelligence (cs.AI)
备注: 14 pages main text, 82 pages total including appendix; 38 figures, 4 tables

点击查看摘要

Abstract:Autonomous AI Research promises to accelerate the scientific progress of machine learning. To realise this goal, current Large Language Model (LLM)-based agents need to go beyond just writing code, to mastering the exploration of simultaneously performant, diverse and novel ideas. To this end, we introduce Heuresis, a framework that abstracts the research pipeline into a set of general and composable primitives, enabling open-ended scientific exploration in machine learning research. We implement six search strategies: a greedy baseline, two archive-based (MAP-Elites, Go-Explore), one evolutionary (Islands), and two divergent (Curiosity, Omni), and evaluate them across three axes (Quality, Diversity, and Novelty) on three domains (LLM Pretraining, On-Policy RL, and Model Unlearning), totalling 3,222 scored runs. We find that completely novel ideas are rare. No idea across our scored runs is rated as “Original”, and only a few achieve only “Minor Similarity” to prior work. Moreover, novel ideas never approach the highest-performing known-recipe scores. Across all six strategies and three domains, only one such idea lands in the top-10 by quality. We also observed agents resorting to a variety of reward-hacking techniques during execution (40 confirmed fabrications across 1,628 scored runs), and detecting them was necessary to keep the search faithful to the task. Our results show that while current search and Quality-Diversity strategies enable us to steer where the generated ideas land on the quality, diversity, and novelty axes, they do not expand the quality-novelty frontier. Bridging this gap is the open challenge towards the ultimate goal of perpetual, autonomous scientific progress. Code is available at this http URL.

[AI-60] SoK: AI Secure Code Generation: Progress Pitfalls and Paths Forward

链接: https://arxiv.org/abs/2606.25195
作者: Rupam Patir,Keyan Guo,Haipeng Cai,Hongxin Hu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing use of AI systems for code generation raises a central security question: what can today’s models and coding agents actually do to produce secure code, where do they still fail, and what would move the field forward? Existing work has explored prompting, fine-tuning, reinforcement learning, and agentic workflows for secure code generation, but the field still lacks a systematic understanding of how these techniques improve security and why substantial failures persist. In this SoK, we systematize the progress, pitfalls, and paths forward for AI secure code generation. We introduce a three-level framework that measures models’ natural-language understanding of secure coding principles, their code-level actuation of those principles during generation, and the knowledge–actuation gaps between the two. We instantiate this framework across models and coding agents on benchmarks covering both isolated function-level security and full web-application security. Our results show that secure-coding-principle understanding is a statistically strong predictor of code-level outcomes, including functional correctness, security, and joint functional-security correctness. Yet substantial knowledge–actuation gaps remain: models can recognize relevant security principles but still fail to translate them into secure and functional code. These findings offer a principle-centered account of where AI secure code generation stands today and identify concrete paths forward through principle-guided generation, evaluation, benchmarking, and agentic workflows.

[AI-61] ransferability for General Reasoning : An Automated Curriculum for Multi-Domain RLVR

链接: https://arxiv.org/abs/2606.25178
作者: Yongjin Yang,Jiarui Liu,Yinghui He,Lezhen Zhang,Bernhard Schölkopf,Zhijing Jin
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, including supplementary material; code available at this https URL

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when the transferability term is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domain transferability as a key signal for curriculum design in multi-domain RLVR.

[AI-62] Elo-Disentangled Player-Style Embeddings for Human Chess via Rating-Conditioned Residual Move Model

链接: https://arxiv.org/abs/2606.25176
作者: Jason Carlson
类目: Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:We study representation learning for individual human chess style: a per-player embedding learned from a player’s move history such that inner products measure stylistic similarity, while being approximately disentangled from playing strength (Elo). Our key design is a residual formulation: a rating-conditioned base move model (Maia-3 policy logits plus Stockfish-derived features, scored over Maia-2-proposed candidates) captures what a typical player of a given strength would play, and a frozen copy of it anchors a learned move encoder and a per-player vector z, so that z explains only deviations from rating-typical play. The base model improves move prediction over the strong Maia-3 policy by 27-37% relative NLL across the rating spectrum, with the largest gains at the top (2800+); Stockfish’s marginal value grows monotonically with Elo (negligible at 900-1200, +0.085 nats at 2800+). On a shared Elo-stratified benchmark of 22,620 held-out decisions, top-1 move-matching rises monotonically from Maia-2 to Maia-3 to the Stockfish-augmented base (0.51 - 0.57 - 0.68): the base is +33% relative top-1 over Maia-2 and +19% over Maia-3 (30% lower NLL), with the engine-feature lift largest at high Elo. The player embedding adds little to raw move-matching on top of this base – its marginal top-1 gain falls within the 95% confidence interval – and its value is instead representational: z generalizes to held-out decisions without overfitting, re-identifies players from disjoint games above chance, and a linear probe recovers rating from z with only R^2 = 0.06 (no better nonlinearly), evidence it captures style on an Elo-orthogonal axis. We argue that a strong rating-conditioned base plus a compact, Elo-disentangled embedding – separating typical play from individual deviation – is an economical, interpretable model of individual style, an alternative to per-player preference fine-tuning.

[AI-63] RUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory

链接: https://arxiv.org/abs/2606.25161
作者: Tianyu Yang,Sudipta Paul,Vijay Srinivasan,Vivek Kulkarni,Srinivas Chappidi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents rely on long-term memory to support extended interactions and personalized assistance beyond finite context windows. Existing memory agents actively update external memory through generated write, revise, and delete operations, but these updates may omit important information, corrupt existing memory, or introduce unsupported hallucinated content. Once stored, such errors become persistent system-state failures that can affect future reasoning and generation. In this paper, we propose TrustMem, a framework designed to improve the trustworthiness of memory consolidation. TrustMem relies on a Memory Transition Verifier to evaluate the transition process of memory updates in terms of coverage, preservation, and faithfulness. It further constructs preference pairs among candidate updates under the same memory state, enabling preference-guided reinforcement learning to directly optimize memory updating behaviors. Extensive experiments demonstrate that TrustMem improves both memory utility and reliability: it achieves state-of-the-art results across MemoryAgentBench, HaluMem, and the Mem-alpha validation set, improves HaluMem memory extraction by 12.14 F1 points, and reduces transition-level omission, corruption, and hallucination by 40.1%, 79.1%, and 50.0%, respectively, compared with the strongest baseline for each error type.

[AI-64] ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

链接: https://arxiv.org/abs/2606.25156
作者: Habibullah Akbar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern large language models based on softmax scaled-dot-product attention are constrained by their training sequence length: as the key-value sequence grows, softmax probability mass can dilute across a wider distribution, inducing activation shift and long-context performance collapse. Moreover, long-context language modeling faces a structural tension: a sliding-window attention core maintains a bounded local representation and low perplexity but is blind to long-range dependencies, while full-context attention preserves global recall but suffers from out-of-distribution perplexity explosion. To resolve these limitations, we introduce ATMA, a hybrid convolutional-attention architecture that integrates a novel three-channel attention mechanism. ATMA factorizes the attention mixing step into: (1) a count-blind, unit-vector direction channel, (2) a bounded magnitude channel driven by the participation ratio of effective matches over an extreme-value-corrected null sink, and (3) a long-term recurrent compression memory optimized via a gated-delta fast-weights rule. Neither the Polar Attention core nor the recurrent memory is sufficient alone; their combination enables monotonic perplexity reduction and high-fidelity long-range retrieval simultaneously. We evaluate ATMA using a 100-run factorial ablation sweep, demonstrating that the combined Polar + memory model maintains induction needle-in-a-haystack retrieval accuracy above 90% out to 64K tokens (32 times the training length of 2K) while its document perplexity improves monotonically, outperforming softmax-based memory baselines which collapse at extreme context lengths. Code: this https URL

[AI-65] Silent Failures in Physics-Informed Neural Networks: Parameter Poisoning and the Limits of Loss-Based Validation

链接: https://arxiv.org/abs/2606.25151
作者: David McShannon,Nicholas Dietrich
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) embed governing equations in their loss function, enabling mesh-free solutions to partial differential equations. Low training loss is treated as evidence that the learned solution is physically correct. This paper shows that assumption breaks down when encoded physics are incorrect. By perturbing PDE parameters before training, a setting we describe as physics parameter poisoning or parameter misspecification, we produce models that train to low loss but give incorrect answers; we treat the perturbation schedule as sensitivity analysis rather than only as a security threat, and none of our claims requires an adversary. Achieving low residual loss does not discriminate accurate from inaccurate solutions: poisoned models reach losses at or below the clean baseline yet differ by large margins, so driving the residual down is not evidence of physical accuracy. Across three PDE systems (Burgers equation, Navier-Stokes cavity, and convection-diffusion), poisoned models match or beat the clean-model training loss while their solutions differ by up to 71% in the fixed sweep and up to 128% under adversarial search; at Cavity Re=400 the poisoned loss falls below the clean baseline. We define a detection difficulty ratio R (solution error divided by training loss) to summarize how invisible the corruption is, though cross-PDE comparison is complicated by differences in loss scale. We test six candidate defenses, none of which reliably detects corruption across all regimes. We propose a post-hoc defense: sweeping the PDE residual loss across parameter values without retraining. The loss minimum recovers the true training parameter without external data, and generalizes across all three PDE systems. The effect holds across five network architectures (8.7K to 133K parameters), is bidirectional, and is confirmed across multiple random seeds.

[AI-66] Reward-Conditioned Attention: How Reward Design Shapes What Autonomous Driving Agents See

链接: https://arxiv.org/abs/2606.25127
作者: Mohamed Benabdelouahad,Ahmed Djalal Hacini,Nadir Farhi,Aissa Boulmerka
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We investigate how reward design shapes the internal attention patterns of reinforcement learning agents trained for autonomous driving. Using three Perceiver-based agents that share identical architectures and training data but differ only in their reward configurations \unicodex2014 ranging from basic violation penalties to continuous proximity penalties \unicodex2014 we analyze cross-attention allocation across 50 real-world scenarios from the Waymo Open Motion Dataset. A central methodological finding is that naïve pooling of timesteps across episodes substantially underestimates the attention \unicodex2013 risk relationship; within-episode correlation with Fisher z-transform aggregation is the appropriate statistic and reveals a robustly positive link between collision risk and agent-directed attention. Building on this validated methodology, we demonstrate two reward-conditioned effects: agents trained with navigation rewards allocate up to 2.0\times more attention to GPS-path tokens than those trained with additional proximity penalties \unicodex2014 and 4.7\times more than agents with no navigation incentive \unicodex2014 revealing that reward content directly determines which scene elements the encoder prioritizes, and continuous time-to-collision penalties create a \textitlearned vigilance prior \unicodex2014 elevated resting agent surveillance maintained throughout collision-free phases. In several scenarios, the complete-reward and minimal-reward models exhibit opposite attention \unicodex2013 risk correlation directions, demonstrating that reward design can qualitatively reverse attentional strategy rather than merely modulating its magnitude. These results suggest that attention analysis is a practical diagnostic for verifying that a reward function produces the intended representational behaviour in safety-critical RL systems.

[AI-67] AeroCast: Probabilistic 3D Trajectory Prediction for Non-Cooperative Aerial Obstacles via Transformer-MDN Architecture

链接: https://arxiv.org/abs/2606.25122
作者: Syed Izzat Ullah,Jose Baca
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous aerial vehicles operating in shared airspace must predict the future positions of non-cooperative obstacles to plan evasive maneuvers before a collision becomes unavoidable. Unlike cooperative systems that share intent, non-cooperative obstacles such as birds, uncontrolled drones, or debris exhibit multi-modal motion that deterministic predictors cannot adequately represent. Existing methods either rely on recurrent encoders that propagate temporal information sequentially, limiting their ability to capture long-range kinematic precursors of maneuver initiation, or produce point forecasts that provide no distributional information to downstream planners. This paper presents AeroCast, a probabilistic trajectory prediction framework that combines a Transformer encoder with a Mixture Density Network output head to predict per-timestep Gaussian mixture distributions over future three-dimensional displacements. A translation-invariant consecutive displacement encoding and a calibration-oriented training objective address the input design and mode-degeneracy challenges specific to mixture-based aerial trajectory prediction. On a hybrid real-and-synthetic quadrotor corpus spanning nine motion categories, AeroCast reduces Average Displacement Error and Final Displacement Error by approximately 50% relative to the baselines over a five-second horizon, and achieves the lowest negative log-likelihood and Continuous Ranked Probability Score among all compared methods. Ablation analysis identifies velocity input and model capacity as the primary contributors to prediction quality, and positional encoding as essential for long-horizon trajectory coherence. AeroCast inference completes in 0.1ms per sample, compatible with real-time onboard deployment at 100Hz.

[AI-68] JupOtter: Cell-Level Bug Detection in Jupyter Notebooks

链接: https://arxiv.org/abs/2606.23877
作者: Lukas Ottenhof,Thibaud Lutellier
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the 42nd International Conference on Software Maintenance and Evolution - ICSME 2026 (Research Papers Track)

点击查看摘要

Abstract:Jupyter Notebooks are an increasingly popular coding environment used across many domains, especially in Python-based data science and scientific computing. Originally used for prototyping and interactive exploration, notebooks are increasingly used to develop more complex programs, leading to a rapid rise in buggy notebooks on platforms like GitHub. To address this trend, we present JupOtter, a bug detection system designed specifically for Jupyter Notebooks. JupOtter features three novel contributions: (1) a notebook-specific tokenization strategy that preserves cell structure, (2) a cell-level bug prediction technique, and (3) a new labeled dataset, OtterDataset, containing over 21,000 notebooks annotated for fine-grained cell-level bug detection. JupOtter achieves cell-level bug detection F1 scores that surpass static analyzers and large language models in two out of three evaluation datasets.

[AI-69] MGI: Member vs Generated Inference ECCV2026

链接: https://arxiv.org/abs/2606.23872
作者: Bihe Zhao,Michel Meintz,Juangui Xu,Franziska Boenisch,Adam Dziedzic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:As generative models increasingly produce samples that are indistinguishable from human-created content, it becomes difficult to determine whether a given data point was part of a model’s natural training set or was generated by the model itself, especially when models memorize and reproduce training data. We formalize this challenge as Member vs Generated Inference (MGI): given a sample and a target generative model, infer whether the sample is a true training member or a generated output of that model. Focusing on image generation, we show that existing membership inference methods systematically misclassify generated samples as training members, while attribution-based methods often misclassify true members as generated. This failure arises because both approaches rely on likelihood-related signals that are similarly elevated for training examples and for the model’s own outputs. To address MGI, we propose Data Circuit Breaker (DCB), a three-stage method that combines complementary signals from a generative model’s autoencoder and latent generator to distinguish training members from generated samples. Across multiple generative models, including image autoregressive and diffusion models, DCB consistently addresses the shortcomings of membership inference and attribution methods, remains effective even when models reproduce near-duplicates of training samples, and generalizes to challenging model derivative settings in which new models are trained on generated data.

[AI-70] Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications

链接: https://arxiv.org/abs/2606.23858
作者: Merkouris Papamichail,Konstantinos Varsos,Giorgos Flouris,João Marques-Silva
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:A primary challenge in AI safety is the existence of adversarial examples – slightly distorted inputs that cause a neural network (NN) to misclassify. To mitigate this problem, recent research focuses on the computation of robustness certifications, which, for a given input, determine the largest distortion the input may receive without breaking the network’s prediction. Robustness certifications can be interpreted as an axis-aligned hyper-rectangle (multi-dimensional intervals). Most existing approaches focus on maximizing the certification’s volume, but recent intractability results prohibit the computation of volume-optimal certifications in reasonable time. We introduce the apothem measure and show how to compute apothem-optimal certifications in a linear number of calls to a NN verifier (oracle) w.r.t. the input domain’s diameter. Moreover, we prove that we cannot have a volume-optimal, oracle-based algorithm, even if we discard the oracle costs. Also, we introduce dual certifications – an interval including all instances of a class – thus providing apothem-minimum upper bounds to a robustness certification. Further, we present the ParallelepipedoNN system, which we evaluate on the standard MNIST and Fashion MNIST benchmarks. A preliminary comparison with existing work on the same datasets reveals at least two-fold improvement w.r.t. the minimum edge length.

[AI-71] Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

链接: https://arxiv.org/abs/2606.23830
作者: Fang Wu,Weihao Xuan,Jure Leskovec,Yejin Choi,Li Erran Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular surfaces encode the geometric and physicochemical patterns that determine antibody-antigen recognition, central to epitope prediction. However, existing methods rely on sequences or backbone structures and struggle to capture discontinuous, surface-driven epitopes. This study presents SurfBind, a surface-centric learning framework for epitope prediction that operates directly on molecular surface representations. SurfBind integrates geometric and physicochemical cues through a Transformer-based architecture with patch-level surface modeling, binder-aware cross-attention, and a hierarchical coarse-to-fine prediction paradigm. Experiments on challenging epitope identification benchmarks, including SAbDab and DB5.5, demonstrate that SurfBind achieves state-of-the-art performance and strong generalization across unseen antibodies and conformational states, highlighting the value of interaction-aware surface modeling for understanding the crucial mechanisms of protein-protein interactions.

[AI-72] Cryptographic certificates of validity for trustworthy AI

链接: https://arxiv.org/abs/2606.23768
作者: Murdoch J. Gabbay
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We propose cryptographic certificates of validity for agentic AI systems. The core idea is to formally specify a correctness or policy condition as a logical predicate, compile this predicate to a witness-checking problem over polynomial constraints, and use a succinct cryptographic proof system (and optionally zero-knowledge) to certify that the condition holds. This offers a middle ground between formal verification of source code, and cryptographic authentication. An agent’s action can be accompanied by an independently checkable proof that it satisfies an agreed formal policy, without requiring the verifier to trust the agent or to re-execute computation. We outline the approach at a high level, give the core mathematical translation, relate the proposal to proof-carrying code, zkVMs, formal methods, and agent governance, and note the specification, auditing, and deployment questions that a full implementation must answer. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) MSC classes: 03B70, 68T27, 68T42 ACMclasses: D.2.4; F.4.1 Cite as: arXiv:2606.23768 [cs.CR] (or arXiv:2606.23768v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.23768 Focus to learn more arXiv-issued DOI via DataCite

[AI-73] Neuromorphic Speech Enhancement with Dual-Branch Spiking Neural Networks INTERSPEECH2026

链接: https://arxiv.org/abs/2606.23761
作者: Taiyu Meng,Wenbin Jiang,Haoyi Zhang,Yuhan Zhou,Haibing Yin
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures, 2 tables. Submitted to Interspeech 2026

点击查看摘要

Abstract:Spiking neural network (SNN)-based neuromorphic speech enhancement has emerged as a promising paradigm due to its energy efficiency, yet it still underperforms classical artificial neural network (ANN)-based approaches owing to binary activations and the lack of well-designed network architectures. To overcome this limitation, we propose a novel dual-branch spiking neural network architecture equipped with a gated spiking unit (GSU), termed GSU-DBNet. Specifically, GSU-DBNet simultaneously models the speech magnitude spectrum and complex spectrum, predicting the corresponding magnitude and complex spectral masks. Meanwhile, a dual-path GSU module is adopted to exploit temporal and frequency information for enhanced spatiotemporal feature representation. Experiments on a popular benchmark dataset show that GSU-DBNet achieves a PESQ score of 3.04 with only 394K parameters, outperforming existing SNN-based methods while using only 4.5%–10.6% of the parameters of representative ANN-based models.

[AI-74] VeriPilot: An LLM -Powered Verilog Debugging Framework

链接: https://arxiv.org/abs/2606.23759
作者: Yihan Wang,Cheng Liu,Jiazheng Zhang,Lei Zhang,Long Cheng,Xiaowei Li,Huawei Li
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Verilog debugging remains one of the most time-consuming stages in digital circuit design. Recent advances in Large Language Models (LLMs) have enabled automated debugging; however, most existing approaches rely solely on test outputs and compiler feedback in an end-to-end manner, limiting their effectiveness on complex bugs. A key challenge is that the root cause of an error may be far removed from its observable outputs, making it difficult for LLMs to trace long dependency chains in code. This challenge is further exacerbated in large codebases, where long context lengths hinder efficient reasoning. To address these limitations, we propose VeriPilot, an LLM-powered debugging framework that leverages golden reference models to enable fine-grained bug localization and repair. VeriPilot goes beyond output-level comparison by aligning internal variable semantics between the Verilog design and its corresponding golden model through LLM-based analysis. It then performs step-by-step signal tracing using Control-Data-Flow Graphs (CDFGs) derived from static analysis, identifying a minimal set of suspicious code regions along with their correct counterparts from the golden model. These structured insights are subsequently provided to the LLM to guide reasoning and automated code repair. Experimental results on the Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA demonstrate that VeriPilot improves the repair success rate of GPT-4o from 54.3% to 85.71%, significantly enhancing both bug localization accuracy and repair effectiveness for complex Verilog designs. The source code and benchmark are publicly available at Github this https URL.

[AI-75] Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios

链接: https://arxiv.org/abs/2606.23758
作者: Xiran Wang,Jian Zhang,Lei Qi,Yang Gao,Yinghuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Domain generalization learns from multiple source domains to generalize to unseen target domains. However, it often neglects the realistic case of label mismatch between source and target. Open set domain generalization is then proposed to recognize unseen classes in unseen domains. A simple approach trains one-vs-all classifiers to separate each class and detect outliers as unknown. Yet, the imbalance between few positive samples and many negative samples skews the decision boundary towards the positive ones, leading the model to over-reject out-of-distribution data, even from known classes in unseen domains. In this paper, we propose a novel meta-learning stategy called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC), which considers implicit gradient matching towards inter-domain and inter-class task splits simultaneously to find optimal boundaries balanced for both domains and classes. Experimental results show that MEDIC not only outperforms prior methods in open set scenarios, but also maintains competitive close set generalization ability.

[AI-76] Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery

链接: https://arxiv.org/abs/2606.23757
作者: Runzhe Liu,Zihao Wang,Wenbo Yang,Shengyang Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting interpretable governing equations from sparse, noisy chemical time-series data remains difficult because discrete reaction topology and continuous kinetic parameters are tightly coupled. We present PC-MCMC-CIGP, a reproducible gray-box workflow that combines spike-and-slab topology sampling, hard conservation and thermodynamic screening, and a Chemical-Informed Gaussian Process (CIGP) residual model for parameter calibration and experimental design. The methodological contribution is not a new MCMC or GP family in isolation; rather, it is the integration of these components into a physically constrained workflow with explicit uncertainty-aware acquisition choices. On the H2 + Br2 benchmark, the constrained sampler distinguishes elementary radical pathways from deceptive phenomenological fits in our experiments. On styrene epoxidation, the CIGP optimization loop improves final yield by 12.5% over the reported GP-BO baseline. A new 10-seed acquisition study shows that EI, GWU, PC-EI, uncertainty sampling, discrepancy hunting, and random search have different trade-offs: PC-EI substantially reduces low-yield BO suggestions, while EI-style criteria give the strongest final-yield performance.

[AI-77] Low-power analogue neural networks with trainable nonlinear connections for continuous control

链接: https://arxiv.org/abs/2606.23742
作者: Ian T. Vidamour,Fernando Aguirre,Thomas J. Hayward,Matthew O. A. Ellis,Charles Swindells,Alexander McDonnell,Martin Trefzer,Finley Robins,Luca Manneschi,Susan Stepney,Tony Kenyon,Oliver J. Sutton,Jack C. Gartside,Ivan Y. Tyukin,Adnan Mehonic,Eleni Vasilaki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Preprint. Further verification of all simulations is ongoing. Any resulting corrections will be incorporated in a revised version

点击查看摘要

Abstract:Physical neural networks promise low-power machine learning by computing directly with analogue device physics, but most architectures force nonlinear device responses to act as scalar weights. Inspired by Kolmogorov-Arnold networks, we place trainable nonlinear functions on the connections, making each physical connection a learnable computational element. Realising these functions as analogue band-pass filters on field-programmable analogue arrays, we find that the benefit is task-dependent and follows from the smoothness of the physical basis: the networks represent smooth, continuously valued targets, including robotic kinematics, continuous control, and photovoltaic maximum-power-point tracking, with far fewer nodes and connections than multilayer perceptrons, but offer no parameter-efficiency advantage on classification-like decision boundaries. Trained networks transfer to hardware across approximately 35,000 connections with quantified fidelity, and a dedicated CMOS implementation is projected to operate at approximately 30 microwatts. A memristive realisation reproduces the same behaviour in simulation, indicating that the advantage comes from placing trainable nonlinearity on connections, rather than from a particular device.

[AI-78] A Survey on Federated Causal Discovery and Inference

链接: https://arxiv.org/abs/2606.23741
作者: Xianjie Guo,Yuwei Wang,Guodu Xiang,Xiaoli Tang,Kui Yu,Han Yu,Qiang Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 27 pages, 4 figures, 2 tables, journal

点击查看摘要

Abstract:Causal reasoning, which encompasses the discovery of causal structures and the inference of causal effects, is fundamental to data-driven decision making. In practice, data for reliable causal analysis are often distributed across institutions and cannot be centralized due to privacy regulations or communication constraints. Federated learning (FL) addresses this by enabling collaborative analysis without raw data sharing, giving rise to the rapidly growing field of federated causal discovery (FCD) and inference (FCI). However, the interdisciplinary nature of this field and the absence of a comprehensive survey present barriers to entry for researchers. This paper bridges that gap by providing a systematic review through multi-dimensional taxonomies. Grounded in the three core design decisions underlying any FCD solution, namely how structures are learned, how data are partitioned, and what structural knowledge each party obtains, we organize FCD along three axes: methodological paradigm, federation topology, and structural scope. We further examine key practical dimensions, including temporal dynamics, data heterogeneity, missing data, and non-identical variable sets. For FCI, we categorize methods by target estimand (average versus individualized/conditional treatment effects) and by estimation strategy, from classical weighting methods to modern deep generative architectures. Unlike prior works that treat FCD and FCI separately, we formalize their connection as complementary stages of a unified federated causal reasoning pipeline, where FCD supplies the structural knowledge required for valid effect estimation in FCI. Finally, we highlight their shared concerns regarding privacy, communication efficiency, theoretical guarantees, and application domains, and conclude by identifying open challenges for future research.

[AI-79] Weight-Space Geometry of Offline Reasoning Training ICML2026

链接: https://arxiv.org/abs/2606.23740
作者: Aleksandr Nikolich,Igor Kiselev,Vladimir Platonov,Karina Romanova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted for ICML 2026 workshop

点击查看摘要

Abstract:Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, we analyze the resulting deltas via cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. We observe: (i) SFT, RFT, and RIFT have nearly colinear weight deltas (cosine = 0.97, top-1 principal angle ~7 deg median over 144 modules) and comparable GSM8K accuracy (87-88%, n=1319; pairwise McNemar p = 0.15); (ii) DFT diverges further in direction than any reward-weighted method despite using the same data; (iii) Offline GRPO adds a substantial component orthogonal to the SFT direction (~67% globally, up to ~86% in late layers) while staying in the SFT loss basin; (iv) DPO sits in a near-orthogonal subspace, shows a mode-connectivity barrier, and collapses late-layer CKA to ~0.46. DPO also reaches the highest accuracy in our protocol on both GSM8K (93.5%, McNemar p 10^-9 vs. each other method) and AIME26 (30.0% vs. 3.3-10.0%); its training uses a 10x smaller learning rate than the others (the standard convention), so the update-norm and accuracy gaps reflect loss-function and optimizer choices jointly, and a learning-rate-matched DPO comparison is left for future work.

[AI-80] When Multi-Sensor Fusion Fails to Generalize: Cattle Posture Classification Under Animal-Level and Temporal Distribution Shift

链接: https://arxiv.org/abs/2606.24986
作者: Leutrim Uka,Severino Pinto,Gundula Hoffmann,Marina M.-C. Höhne
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Automated cattle posture-classification systems frequently report near-perfect accuracy, yet their robustness under realistic deployment conditions remains largely unknown. In particular, it is unclear whether multimodal sensor fusion improves generalisation or leads models to rely on context-specific signals that fail under distribution shift. Here, we evaluate the robustness of automated posture classification (lying versus standing) using collar accelerometers, rumen-bolus sensors, and environmental measurements collected from a pasture-based beef cattle herd across two consecutive years (2024-2025). XGBoost served as the primary model, with Logistic Regression, Random Forest, and Long Short-Term Memory networks evaluated as comparative baselines. Model robustness was assessed under progressively more stringent evaluation protocols, ranging from conventional random train-test splits to leave-one-animal-out validation and cross-year evaluation on an independent cohort of previously unseen animals recorded one year later. While multimodal models achieved strong within-year performance (macro-F1 0.94), the performance declined substantially under cross-year evaluation (macro-F1 0.49). Explainability analysis revealed persistent reliance on rumen-bolus activity and environmental variables even when predictive performance deteriorated. Distribution-shift diagnostics further confirmed substantial differences in feature distributions between recording years. Our findings demonstrate that commonly used evaluation protocols can substantially overestimate real-world performance and that multimodal sensor fusion may reduce, rather than improve, robustness under temporal distribution shift. More broadly, the results highlight that benchmark accuracy alone is insufficient to assess deployment readiness and underscore the need for robustness-centred evaluation in livestock-monitoring research.

[AI-81] Retrieval-Augmented Personalization with Foundation Models for Wearable Stress Detection

链接: https://arxiv.org/abs/2606.24985
作者: Louis Simon,Mohamed Chetouani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalization in wearable-based stress detection remains challenging due to substantial inter-individual variability in physiological and behavioral responses. While traditional approaches rely on user-specific fine-tuning or costly self-supervised pre-training on large datasets, we propose a lightweight alternative based on retrieval-augmented personalization. Our method leverages frozen, out-of-domain foundation models to retrieve similar patterns from a target user’s history and encode them into a compact personalized embedding that modulates representations extracted by a lightweight transformer network. We evaluate our approach on the WESAD stress detection dataset with N=15 users, comprising wrist-worn physiological (EDA, BVP, temperature) and activity (accelerometer) signals, and report gains of +3.92% in accuracy and +4.76% in macro F1-score over a non-personalized transformer baseline, approaching supervised fine-tuning performance without requiring any labeled user data. We further show that temporal retrieval, where only prior user samples are available, achieves performance close to full intra-user retrieval, demonstrating robustness to limited user history. Finally, we explore personalization in a cross-dataset retrieval setting, leveraging embeddings from the K-Emocon dataset to personalize representations for stress detection on the WESAD dataset.

[AI-82] Quantifying Explainable AI-introduced signal noise on ECG data with Spectral Entropy

链接: https://arxiv.org/abs/2606.24974
作者: David A. Kelly,Nathan Blake
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to EUSIPCO 2026

点击查看摘要

Abstract:Explainability techniques are used to assess the output of various deep learning models. This is especially true in healthcare, where models need to be trusted and decisions justified. Explainability (XAI) tools use heuristics which often add signal noise to the explanation “core”. It is not always obvious what is signal from the model and what is noise from the XAI. We propose the use of spectral entropy as a measure of noise in XAI output. We demonstrate its usefulness in the context of classifying arrhythmias in an ECG dataset with different post hoc explainability techniques.

[AI-83] What Do Language Priors Contribute to Darcy-Flow Inversion? A Mechanistic Audit

链接: https://arxiv.org/abs/2606.24967
作者: Taiga Saito,Yu Otake,Daijiro Mizutani,Sopheakpolin Mom
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In ill-posed inverse problems, the recovered solution depends as much on the prior as on the data, yet much of the engineering knowledge that could serve as that prior is recorded qualitatively rather than in formal mathematical form. Here we test whether sentence embeddings can act as an inference-time interface for injecting geological descriptions into a learned Darcy-flow inverse solver. Across six synthetic geological classes and an exploratory transfer to a benchmark reservoir model (SPE10), we vary only the conditioning representation and find that text conditioning reduces reconstruction error by 81 % relative to a no-text counterfactual. Most of this gain comes from a categorical, class-level constraint whose value concentrates where the hydraulic head leaves the conductivity field underdetermined, while within-class geometric detail is secondary and pattern-dependent. Compared with a discrete class label, sentence embeddings add little dense-observation accuracy but improve training stability and enable paraphrase-based sensitivity analysis and open-vocabulary inputs. These results show that language priors can serve as an engineering-informatics interface for injecting geological knowledge into learned inverse solvers, while clarifying when they help and what signal they actually carry.

[AI-84] Project Auto-World: Towards Automated Benchmarking of Neural Relational Reason ers NEURIPS2026

链接: https://arxiv.org/abs/2606.24965
作者: Anirban Das,Joanne Boisson,Irtaza Khalid,Sumita Garai,Steven Schockaert
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to NeurIPS 2026 ED track. Code is available at this https URL

点击查看摘要

Abstract:Reasoning about relational structures remains a significant challenge for neural models, particularly when they must systematically apply learned knowledge to problem instances that are harder than those seen in training. Progress is hampered by the difficulty of evaluating such generalization, since a priori, it is rarely clear what makes an instance hard. We study how this issue can be addressed by using large language models (LLMs) to automate benchmark generation, learning to produce increasingly challenging instances in an end-to-end manner. Concretely, given a world parametrized by Datalog rules, and an Edge Transformer as the reasoning evaluator, we use LLM-driven evolutionary search (based on FunSearch) and autonomous agentic search to discover sampling functions that yield hard problem instances. We also show that the Edge Transformer can be improved using this data such that it generalizes well to further data perturbations. Finally, we show that the same machinery can be applied to novel worlds proposed by LLMs, opening the door to autonomous research on neural relational reasoning.

[AI-85] Enhancing Clinician Decision-Making via Uncertainty-Aware Multi-Expert Fusion for Stroke Rehabilitation

链接: https://arxiv.org/abs/2606.24960
作者: Tamim Ahmed,Thanassis Rikakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tailoring stroke rehabilitation requires assessing how movements are organized, not merely if they succeed. Currently, this assessment is a rate-limiting bottleneck. Instruments like the Action Research Arm Test (ARAT) compress rich behavioral observations into single ordinal endpoints, discarding the movement-quality details that distinguish recovery from compensation. Automated alternatives typically chase accuracy on noisy, single-observer labels to output opaque scores - a technology-centric approach that rarely reaches clinical practice. To address this, we present xAARA: an engine designed to augment rather than replace clinical judgment. From multi-view video, xAARA returns ARAT assessments with calibrated uncertainty and explanations across task, movement-phase, and movement-quality levels. Treating clinical scoring as an ill-posed inference problem, xAARA composes 692 calibrated multimodal models via a Dynamic Bayesian Network with entropy-based gating. It qualifies results against clinical validity rules and defers low-confidence cases. In 105 stroke survivors (788 exercises), xAARA achieved 94.2% task accuracy (Cohen’s kappa=0.934) and 81.3% movement-phase accuracy (kappa=0.727), reducing predictive uncertainty by 96.1% compared to single-clinician scoring. For subjective cases, it matched at least one rater 100% of the time and never returned out-of-range scores. Four independent clinicians validated the assessments and indicated willingness to adopt the system. We argue that principled uncertainty quantification and clinician-aligned explainability are the critical bridges moving automated assessment from technical demonstration to a deployable clinical tool.

[AI-86] Reliable Conformal Prediction for Ordinal Classification Using the Ranked Probability Score

链接: https://arxiv.org/abs/2606.24959
作者: Stefan Haas,Luca Killmaier,Alireza Javanmardi,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ordinal classification (OC) arises in high-stakes domains such as medicine and finance, where uncertainty quantification must account for the severity of ordinal errors. Conformal prediction (CP) provides distribution-free prediction sets with marginal coverage guarantees; however, its practical effectiveness depends critically on the choice of nonconformity function. We introduce a CP method for ordinal classification based on the ranked probability score (RPS), a proper scoring rule defined over cumulative predictive distributions. Although it reflects ordinal risk quite naturally, it has largely been neglected in conformal ordinal prediction (COP). When used as a measure of nonconformity, RPS yields median-centered contiguous prediction sets by construction. The method is model-agnostic, supports both assessed and grouped ordered categorical outcomes, and permits efficient implementation compared to greedy interval selection procedures. Across multiple ordinal image and tabular datasets, RPS-based CP produces contiguous prediction sets and strikes a favorable balance between prediction set width and the magnitude of ordinal miscoverage relative to existing CP methods.

[AI-87] Convex–Concave Quadratic Spectral Filtering for Graph Neural Networks

链接: https://arxiv.org/abs/2606.24956
作者: Ranhui Yan,Jia Cai,Mengzhu Chen,Haodong Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spectral graph neural networks (GNNs) interpret message passing as frequency-selective filtering. While low-order spectral filters are efficient, their limited selectivity often leads to weak attenuation outside the passband, whereas high-order alternatives introduce optimization challenges. We propose DCQ-GNN, a spectral GNN based on a compact bank of adaptive convex–concave quadratic filters. By restricting the filter order to two while explicitly exploiting complementary curvature, DCQ-GNN improves spectral selectivity as quantified by Dirichlet energy and entropy measures without resorting to high-order polynomial expansions. The model fuses filter outputs through a node-adaptive gating mechanism to enable node-wise structure-aware spectral selection. We provide a formal spectral analysis grounded in Dirichlet energy attenuation, von Neumann entropy, and curvature polarity, and derive explicit characterizations of filter behavior across varying levels of homophily and structural perturbations. Extensive benchmarks on 10 datasets show that DCQ-GNN ties for the top average rank (3.0) on heterophilic graphs and obtains the second-best rank (4.2) on homophilic graphs, remaining competitive with representative high-order polynomial spectral filters. Furthermore, under strong structural perturbations, DCQ-GNN exhibits substantially smaller performance degradation compared to both first-order and high-order baselines. These results demonstrate that curvature-aware quadratic banks provide a robust and efficient alternative to high-order spectral models while preserving optimization stability and computational efficiency.

[AI-88] MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

链接: https://arxiv.org/abs/2606.24950
作者: Patara Trirat,Jin Myung Kwak,Jay Heo,Heejun Lee,Sung Ju Hwang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Financial decision-making is contextual: forecasting prices, valuing companies, and assessing event exposure weigh price history, accounting fundamentals, macroeconomic regime, and contemporaneous text. A benchmark over these four signals is hard to build because finance violates four assumptions of time-series evaluation: text must be gated by its publication date to prevent look-ahead, quarterly fundamentals are reported with a one- to ninety-day lag, filing text is partly redundant with the numerical statement fields it accompanies, and macroeconomic regimes leak across calendar splits. No public benchmark addresses all four signals jointly. MacroLens covers 4,416 U.S. small- and micro-cap equities over 2021-2026. Seven tasks share one point-in-time panel of prices, 46.8M XBRL accounting facts, 53 macroeconomic series, 295,860 SEC filings, and 215,882 news articles, plus a scenario layer of 1,130 macroeconomic events across 49 types automatically detected and rendered as natural language. Tasks span contextual forecasting, public and private valuation, statement generation from fundamentals and descriptions, scenario-conditioned returns, and real-estate valuation. We evaluate 19 methods across six families spanning naive heuristics through time-series foundation models, fine-tuned LLM-based time-series models, and zero-shot large language models (LLMs), plus a five-step feature-context ablation on two frontier LLMs and a gradient-boosted baseline. MacroLens is released at this https URL.

[AI-89] What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients

链接: https://arxiv.org/abs/2606.24949
作者: Tuan Nguyen(LIA, AU),Corinne Fredouille(AU, LIA),Alain Ghio(LPL),Muriel Lalain(LPL),Virginie Woisard(UT2J, UT3, LNPL)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This work investigates the interpretability of a Wav2Vec 2.0based speech intelligibility assessment model for oral and oropharyngeal cancer patients through canonical correlation analysis. By measuring the correlation between the model embeddings and eGeMAPS low-level descriptors (LLDs) as an interpretable reference, we analyze how acoustic information is encoded across the model layers. The analysis is conducted at two levels: individual LLDs layer-wise, and group-level: prosodic, spectral, and voice quality. Results show that the learned representations are most strongly correlated with spectral and prosodic features, with the first MFCC coefficient yielding the highest correlations across all layers. At the group level, spectral and prosodic groups achieve correlations of 0.77 and 0.71 respectively, while voice quality reaches 0.65. Beyond model interpretability, this work also offers practical guidance on acoustic feature selection for pathological speech assessment.

[AI-90] Holographic Memory for Zero-Shot Compositional Reasoning in Knowledge Graphs: A Mechanistic Study of Where and Why It Fails

链接: https://arxiv.org/abs/2606.24948
作者: Randhir Kumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 5 tables. Code available at this https URL

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models predict single-hop links well but have no mechanism for zero-shot compositional queries: multi-hop questions whose relation chains never appeared during training. Holographic Reduced Representations (HRR), which bind and unbind symbols via circular convolution, are a theoretically attractive candidate, since binding is approximately invertible and associative. We test whether this promise holds. We study two holographic memory variants, real-valued HRR and phase-only Fourier HRR (FHRR), each with a modern Hopfield cleanup, on FB15k-237 over five seeds. Four findings follow. First, both are competitive single-hop retrievers (filtered MRR 0.358 +/- 0.002 for HRR, 0.350 +/- 0.021 for FHRR). Second, neither composes zero-shot: accuracy stays at chance across all cleanup temperatures. Third, the main contribution, we localise the failure mechanistically. A hop-1 probe shows the memory recovers the correct intermediate entity with high fidelity (MRR 0.896 +/- 0.002 for HRR), yet composition still fails even with a verified-correct intermediate. A second probe shows why: posing the ground-truth second-hop fact as a standalone atomic query, bypassing composition entirely, already recovers it at only 0.26 to 0.48x average atomic accuracy, uniformly across relation fan-out. The bottleneck is not the bind-unbind algebra or the cleanup; it is that facts compositional chains pass through are intrinsically harder for the superposed memory to retrieve, a capacity and interference effect present already at a single hop. Fourth, we prove (Lemma 4.1) that FHRR’s softmax cleanup is not phase-equivariant, compounding the primary failure on the minority of chains where hop-1 itself errs. Fixing zero-shot composition requires improving retrieval capacity under superposition, not just redesigning the cleanup. Comments: 15 pages, 5 figures, 5 tables. Code available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24948 [cs.LG] (or arXiv:2606.24948v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.24948 Focus to learn more arXiv-issued DOI via DataCite

[AI-91] EmotionAI: A Privacy-Preserving Computational Intelligence Pipeline for Speech-Emotion-Grounded Conversational Analysis

链接: https://arxiv.org/abs/2606.24941
作者: Wai Laam Mak,Isibor Kennedy Ihianle,Pedro Machado
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures. Submitted to UK Workshop on Computational Intelligence (UKCI 2026)

点击查看摘要

Abstract:Reviewing recorded interviews for affective cues such as composure, hesitation and agitation is slow and subjective, and cloud services that could automate it require sensitive audio to leave the device. EmotionAI is a fully local Computational Intelligence (CI) pipeline that couples Speech Emotion Recognition (SER) with generative reasoning. Speaker diarisation, Whisper Automatic Speech Recognition (ASR) and a wav2vec2 emotion classifier produce per-segment affective evidence, which is then passed to an adversarial three-model local Large Language Model (LLM) panel for timestamp-grounded and citation-constrained question answering. Zero-shot evaluation on the RAVDESS four-class English subset (n = 672) exposes cross-corpus fragility rather than classifier superiority: the deployed classifier scores 48.8% accuracy, above random (24.9%) and majority (28.6%) baselines but below an in-domain MFCC + logistic-regression comparator (71.0%). The complete pipeline runs in a mean 157 s on CPU (real-time factor approximately 1.33) with zero external calls. The contribution is not state-of-the-art SER but an auditable, privacy-preserving integration of imperfect affective evidence into grounded conversational analysis, together with an honest empirical account of where cross-corpus transfer and human-centred validation still fall short.

[AI-92] Velocity Prediction in Automatic Guitar Transcription

链接: https://arxiv.org/abs/2606.24912
作者: Jackson Loth,Xavier Riley,Simon Dixon,Emmanouil Benetos
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted for publication at the 34th European Signal Processing Conference (EUSIPCO)

点击查看摘要

Abstract:Automatic Music Transcription (AMT) models have achieved a high level of success in polyphonic transcription of various instruments. Velocity, typically a measure of note intensity, is less commonly predicted in these models due to the absence of velocity labels in available datasets and lack of a proper definition for instruments other than piano. We present a methodology and model for velocity prediction in Automatic Guitar Transcription (AGT) which uses virtual instruments to generate synthetic training data with velocity labels. We first pretrain a model on this synthetic data. These weights are then transferred to a different model and trained on real guitar audio, allowing the model to retain the working velocity prediction while also achieving high performance and generalisability from the real training data. The velocity prediction is shown to outperform a baseline model which does not use the pretrained velocity weights, when evaluated on synthetic data. In addition, using the pretrained velocity weights offers a small improvement in note transcription, though the magnitude of this improvement is limited and not always significant depending on the testing data. Overall the model achieves results comparable to the state of the art in guitar transcription, while also successfully predicting velocity.

[AI-93] Attractive and Repulsive Pattern Control in Sequence Generation

链接: https://arxiv.org/abs/2606.24911
作者: Francois Pachet
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Variable-order Markov models preserve local symbolic syntax by adapting context length, but long continuations can enter recurring high-order “tunnels”: repeated suffixes, locally periodic passages, or copied fragments longer than the formal Markov order. This paper introduces signed pattern control for variable-order Markov generation with BP-Regular sampling. A weighted recurrence automaton computes an activation R for a chosen family of target patterns, and belief propagation samples exactly from P_beta(x) proportional to P_0(x) exp(beta R(x)). Negative coupling makes the target patterns costly during sampling; positive coupling rewards the same patterns and turns them into controlled attractors. The target family may be mined online from overactive generated material, supplied by a score or style vocabulary, or designed as an experimental probe. The main experiments use the online homeostatic case, choosing patterns that become overactive in the sampling history. On six duration-bearing monophonic sources, including Bach and Telemann material, the negative branch reduces generated 8-gram self-reuse, increases the effective number of generated 8-grams, and increases coverage of training-supported 4-gram contexts while preserving substantial lower-order support. A pitch-sequence replication on five Weimar Jazz Database solos gives the same anti-reuse signature outside Baroque material. The same signed mechanism also provides a positive branch for probing attractor basins, phase transitions, and hysteresis in the underlying variable-order model. Comments: 16 pages, 6 figures Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24911 [cs.SD] (or arXiv:2606.24911v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2606.24911 Focus to learn more arXiv-issued DOI via DataCite

[AI-94] Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

链接: https://arxiv.org/abs/2606.24902
作者: Arnesh Banerjee,Ayushi Bhattacharjee
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The “First Proof” benchmark [1] posed ten research-level mathematics questions to the strongest publicly available LLMs and found them consistently wrong-not silent, but confidently, fluently wrong. This paper asks why. Working from the per-question post-mortems in First Proof’s Appendix A, I identify four failure modes: citation fabrication (F1), premise smuggling (F2), silent problem reformulation (F3), and local-to-global compatibility gaps (F4). I then audit eight one-shot proofs generated by Gemini 2.5 Flash on Questions 1, 2, and 5 of the benchmark, using two instruments built specifically to surface F1 and F2. The central finding is uncomfortable for anyone who sees retrieval-augmented generation (RAG) as the obvious fix: not one of the eight proofs contained a confirmed fabricated citation, yet every single one contained at least one load-bearing claim asserted as a “fundamental result” or “standard argument” with no justification attached. That failure mode-F2, premise smuggling-is invisible to citation verification by design. A premise-audit instrument I introduce flags it at 100% precision (5/5 judge-confirmed flags are true positives) and 50% proof-level recall in this corpus. The taxonomy and the audit together suggest that the right long-term objective is building inference-time pipelines that prevent these failure modes from occurring, not just detecting them after the fact. Index Terms–Large language models, mathematical reasoning, hallucination, premise smuggling, failure-mode taxonomy.

[AI-95] LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning

链接: https://arxiv.org/abs/2606.24901
作者: Hao Jiang,Enneng Yang,Guojie Zhu,Yibin Chen,Yunkun Xu,Zifu Kou,Jiayi Li,Chong Chen,Zhao Cao,Li Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning capability is critical for Industrial LLMs, as deployed models must be continuously updated to meet evolving requirements and environments, rather than repeatedly retrained from scratch. However, most existing research focuses on improvements on static benchmarks, failing to capture real industrial needs. In this survey, we reformulate Industrial Continual Learning (ICL) for LLMs as a closed-loop update-and-release problem in a versioned ecosystem, where updates propagate hierarchically to industrial, application-specific models and LLM-powered applications, with capability inheritance and transfer across versions and model families. From this ecosystem perspective, we identify three core challenges: repeated adaptation erodes model plasticity, foundation-model upgrades break capability inheritance, and long-term sustainability is constrained by deployment requirements. We then organize the technical landscape of ICL around five lifecycle design principles: preserving plasticity headroom, treating upgrades as capability transfer, enabling trustworthy continual reinforcement learning, making training recipes self-optimizing, and building accountability as a base layer for long-term iteration. For each principle, we synthesize representative technical directions. Finally, we evaluate the maturity of each principle and its technical components via an evidence-based lens, identify key gaps hindering real-world deployment, and outline a practical ICL deployment blueprint and a pathway for feeding industrial realities back into academic research.

[AI-96] On-Device Neural Architecture Search

链接: https://arxiv.org/abs/2606.24900
作者: Andrea Mattia Garavagno,Edoardo Ragusa,Paolo Gastaldo,Antonio Frisoli,Claudio Loconsole
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a new approach to near-sensor computing, in which a lightweight Neural Architecture Search (NAS) is performed directly on the deployment device to find the best tiny neural architecture for analyzing the real-time data acquired through sensors. This new adaptation capability can be particularly useful in the case of human-machine interfaces for which the neural network analyzing the biometrical data can be re-designed each time the user changes, after a guided data collection procedure, fighting the typical data variations between individuals on a new level. To implement the proposed approach a new NAS has been designed and then validated on the Italian Sign Language dataset (ISL), a collection of surface electromyography (sEMG) signals of the signs of the Italian alphabet, using several embedded systems. Moreover, further validation on the Case Western Reserve University dataset (CWRU), a benchmark for intelligent fault diagnosis, is presented to suggest another possible application of the proposed approach. When run on a Raspberry Pi 4, the proposed NAS performs beyond the state of the art proposing a tiny neural architecture having 0.63 times less RAM occupancy and 5.96 percentage points of more accuracy in the case of the ISL dataset; and 0.44 times less RAM occupancy and 0.2 percentage points of more accuracy in the case of the CWRU dataset.

[AI-97] From Meta Idea to Advanced Mathematical Discovery – Human-AI Co-Discovery of Sign-Embedding Quantum Algorithms

链接: https://arxiv.org/abs/2606.24899
作者: Yanqiao Wang,Jin-Peng Liu,Peng Li,Yang Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 35 pasges, 3 figures

点击查看摘要

Abstract:AI-assisted mathematics is often evaluated on solving predefined problems. In practice, however, many important advances begin earlier, when a vague research intuition is transformed into a concrete problem, a promising route, and a theorem family worth proving. This report studies that stage through a case study that led to sign-embedding quantum algorithms for matrix equations and matrix functions, foundational primitives in quantum linear algebra and operator-output quantum algorithms. The project began with a human-originated intuition that rational approximation is especially effective for jump-type functions such as the sign function, and might therefore serve as a design principle for quantum algorithms. Rather than merely assisting after the problem was fixed, AI-assisted exploration, including workflows later integrated into the agentic AI-mathematician system AIM, played a key role in expanding this intuition into a route map, comparing candidate formulations, and converging toward sign embedding as the central framework. AIM then helped connect a known matrix-sign identity to wider classes of matrix equations and matrix functions, and drafted proof and complexity calculations. The decisive scientific judgments remained human: selecting which human-AI-expanded routes were worth pursuing, rejecting a Cayley-trapezoidal approximation when its validity required a hidden condition, and refining the Sylvester implementation from a coarse quadratic-gap query route to the final factorized and scaled analysis. The report argues that human-AI co-discovery workflows, with systems such as AIM as important components, are most valuable not as standalone theorem provers, but as research partners for problem formation, connection discovery, derivation, and skeptical review inside a human-gated research loop.

[AI-98] Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models

链接: https://arxiv.org/abs/2606.24898
作者: Rituraj Sharma,Tu Vu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Looped language models turn hidden states into runtime state: each state is decoded for prediction and fed back into future computation. This creates a basic supervision question: which state variables does cross-entropy actually control? We show that dense per-loop cross-entropy controls the variables exposed by the readout, not every variable active in the recurrent transition. Hidden-state scale gives a concrete failure mode. Scale-invariant readouts such as RMSNorm and LayerNorm hide radial scale from the immediate cross-entropy loss, while pre-norm residual recurrence continues to carry and update that same scale. Thus per-loop loss can make early exits usable without controlling recurrent scale. In 44M and 129M looped transformers without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts still drives final hidden-state norms into the thousands or tens of thousands. Scale-visible readouts and explicit norm penalties keep norms in the tens, and scale-removing recurrence is the complementary architectural fix. The resulting design rule is simple: dense supervision trains exits; recurrent scale control requires either making scale visible to a loss or removing it from the loop. Consistent with this rule, scale-controlled variants achieve lower perplexity at matched inference-depth operating points in our variable-depth benchmarks.

[AI-99] RWGBench: Evaluating Scholarly Positioning in Related Work Generation

链接: https://arxiv.org/abs/2606.24894
作者: Anzhe Xie,Weihang Su,Jiaxin Mao,Yiqun Liu,Shaoping Ma,Qingyao Ai
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 9 pages, code and data available at this https URL

点击查看摘要

Abstract:Large language models have shown strong fluency in scientific writing, yet the evaluation of related work generation (RWG) remains limited. Existing RWG evaluations largely inherit summarization-oriented metrics, using lexical or semantic similarity to reference sections as proxies for quality. However, related work writing is fundamentally a citation-level scholarly positioning task: it requires selecting, organizing, and framing prior work to clarify how a target paper relates to, differs from, and contributes beyond existing this http URL a result, models may generate coherent and semantically-relevant text while exhibiting academically critical failures, such as inappropriate citation selection or misplaced references, that conventional metrics do not this http URL this end, we introduce \textbfRWGBench, a benchmark that evaluates RWG from the perspective of citation decision-making rather than text similarity. RWGBench is constructed from a large-scale collection of 40,108 computer science papers and a retrieval corpus of 1.09 million documents, with a carefully curated test set comprising 100 papers and their corresponding published related work this http URL propose a multi-dimensional evaluation framework that assesses citation selection, contextual appropriateness, organization, and discourse this http URL reveal systematic limitations in current systems that are obscured by standard evaluations, while Oracle studies further disentangle retrieval-level and generation-level bottlenecks. Human evaluation further shows that our citation-centric metrics align substantially better with expert judgment than surface-level text metrics. RWGBench offers a citation-centric testbed for developing and evaluating related work generation systems that are better aligned with scholarly writing practices.

[AI-100] ReviewGuard: Aligning LLM -Assisted Peer Review with Long-Term Scientific Impact

链接: https://arxiv.org/abs/2606.24892
作者: Abdur Rasool,Xiaohui Huang,Yanqing Hu,Linyi Yang
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Peer review is central to scientific quality control, yet it can undervalue papers that later achieve substantial citation impact. While frontier large language models have shown promise in automating aspects of peer review, they primarily mimic human reviewer preferences rather than predict long-term scientific value. We introduce ReviewGuard, a two-stage framework that aligns LLM-generated reviews with citation-based estimates of long-term scientific impact rather than contemporaneous reviewer judgments. On 20,861 AI/ML papers from OpenReview augmented with Semantic Scholar citation data, ReviewGuard achieves a Spearman correlation of \rho = 0.776 with future citations on rejected-then-published papers, outperforming human reviewers (\rho = 0.492) and a supervised Expert model (\rho = 0.681). Under the same decision threshold, ReviewGuard flags 10.2% of high-impact rejected papers, compared with 1.8% for human reviewers, corresponding to a 5.6x improvement. Our results demonstrate that impact-aligned reinforcement learning can provide editors with a complementary signal for identifying high-potential work, without replacing human judgment.

[AI-101] ype Checking Project Haystack Grids using JSON Schema and Pydantic

链接: https://arxiv.org/abs/2606.24891
作者: Thomas Hirsch,Samina Kadkhoda Masoumali,Gerald Schweiger
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Ontologies enable scalable energy services in buildings by supporting interoperability and automation. Project Haystack is a building ontology that is widely adopted due to its flexible, tag-based semantic model, openness, and extensibility, but suffers from ambiguous tag usage and limited automated validation. Although Project Haystack is formally open, its reliance on custom file formats and domain-specific languages that originate from the Haxall ecosystem creates a de facto barrier to integration. In this paper, we address these limitations by introducing a Python-based toolchain for Haystack. We present (i) a parser for Haystack definition files (Trio file format), and (ii) a code generator that derives Pydantic models and JSON Schema definitions from these parsed specifications. The resulting models enable static type checking and enable structural validation of Haystack grids within Python, as well as schema-based validation of JSON representations outside the Python ecosystem. All tools, generated models, and schemas are released publicly under an open-source license, with the goal of strengthening the Haystack ecosystem and opening a practical pathway beyond its current technical boundaries.

[AI-102] SE-AGCNet: An End-to-End Framework for Joint Speech Enhancement and Loudness Control in Meeting Scenarios INTERSPEECH2026

链接: https://arxiv.org/abs/2606.25959
作者: Jinming Zhang,Wei Rao,Xionghu Zhong,Eng Siong Chng
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted by Interspeech 2026

点击查看摘要

Abstract:Conventional audio pipelines typically treat speech enhancement (SE) and automatic gain control (AGC) as discrete modules, which often limits overall performance. For instance, applying AGC before SE may inadvertently amplify background noise, while prioritizing SE tends to over-suppress low-volume speech. To address these limitations, we propose SE-AGCNet, an end-to-end framework that jointly optimizes SE and AGC. Tailored for meeting scenarios with significant volume variations, SE-AGCNet leverages the synergy between the two tasks: SE preserves quiet speech, thereby facilitating effective volume adjustment by the AGC component. Furthermore, we propose a specialized data simulation pipeline, SE-AGC-DataGen, and incorporate standardized loudness evaluation metrics: integrated loudness (LUFS), short-term loudness (St LUFS), and LRA. Experiments show that SE-AGCNet consistently achieves target loudness while improving speech quality and ASR accuracy over competitive baselines.

[AI-103] Measurable Majorities Are Not Finitely Axiomatizable

链接: https://arxiv.org/abs/2606.25954
作者: Lawrence S. Moss,Arthur Paul Pedersen
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Combinatorics (math.CO); Logic (math.LO)
备注:

点击查看摘要

Abstract:This theoretical note studies the finite axiomatizability of strict majority reasoning in finite social decision frames. Moss and Pedersen (2026) doi: https://doi.org/10.48550/arXiv.2606.23853 introduce a coherence criterion that characterizes exactly when qualitative majority judgments are representable by a finitely additive measure. The question addressed here is whether that coherence criterion can be replaced, in the finite setting, by any bounded finite fragment. We prove that it cannot. For every k\ge 1 , we construct a maximal standard frame whose shortest coherence violation has length exactly 2k+2 . Hence there is no uniform finite bound on the incoherence index of social decision frames, resolving Conjecture 5.7 stated by Moss and Pedersen (2026). The construction is geometric, in the sense that it proceeds via orthogonality and dimension in rational vector spaces, and self-contained: it isolates a symmetric family of half-sized voting blocs and extends it to a maximal frame in which every shorter balanced obstruction is excluded. Along the explicit infinite sequence of universe sizes obtained in the construction, this also establishes the middle-layer family predicted by Conjecture B.25 by Moss and Pedersen (2026). Together with the soundness and completeness theorem for the Moss-Pedersen minimal logic for strict majorities, this establishes that measurable social decision frames are not finitely axiomatizable in that language.

机器学习

[LG-0] Real vs. Complex Spectral Bases for Neural Operators: The Role of Greens Function Alignment

链接: https://arxiv.org/abs/2606.24851
作者: Jason Sulskis,Sathya Ravi
类目: Machine Learning (cs.LG)
*备注: Submitted to/in consideration for the 62nd Allerton Conference on Communication, Control, and Computing

点击查看摘要

Abstract:Fourier Neural Operators (FNO) learn solution operators of partial differential equations by parameterizing global convolutions in the complex Fourier domain. For real-valued PDE solutions, the complex FFT carries representational redundancy through conjugate symmetry. We introduce the Hartley Neural Operator (HNO), the exact real-valued mirror of FNO: it replaces the FFT with the purely real Discrete Hartley Transform and learns a single real multiplier per retained spectral mode, with no complex arithmetic. Because the real Hartley spectrum is not halved by conjugate symmetry, HNO retains twice as many frequency corners as FNO but one real weight where FNO carries a complex pair, so the two operators are iso-parametric at equal width and differ only in spectral basis. Our central thesis is that the best basis is a property of the operator. Self-adjoint elliptic operators (Poisson, biharmonic) have real, symmetric Green’s functions that the real Hartley multiplier diagonalizes exactly, and HNO is favored there. Time-dependent operators carry phase, from oscillation in the wave equation to transport in advection, Burgers, and Navier-Stokes, which a real diagonal multiplier cannot represent, so FNO is favored there, and increasingly so with the operator’s phase content, leaving the phaseless heat equation as the borderline case. Training both operators identically and benchmarking across PDE classes, initial-condition families, and boundary conditions, we find an elliptic-versus-time-dependent split that is monotone in operator phase content and matches the Green’s-function theory we develop. Rather than a universal winner, our findings give a predictive rule: match the spectral basis to the symmetry of the solution operator.

[LG-1] Dirac-Frenkel dynamics with inertia for nonlinearly parametrized solutions of evolution problems

链接: https://arxiv.org/abs/2606.24769
作者: Matteo Raviola,Benjamin Peherstorfer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Even when Dirac-Frenkel dynamics determine a well-defined evolution in function space, the corresponding parameter dynamics can be non-unique or ill-conditioned for redundant nonlinear parametrizations such as neural networks or mixture models. We propose to add inertia to the Dirac-Frenkel dynamics and show that this allows useful parameter velocity information to persist from the past trajectory in directions that are weakly informed, while well-informed parameter velocity directions continue to follow the Dirac-Frenkel dynamics. We prove that the inertial formulation yields well-posed parameter dynamics and provide a posteriori error bounds. After time discretization, the method requires the solution of the same type of regularized linear least-squares problem as standard Dirac-Frenkel dynamics, but with the previous velocity appearing as an anchor. Numerical experiments demonstrate the increased robustness obtained with inertia.

[LG-2] Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

链接: https://arxiv.org/abs/2606.25971
作者: Alexander Hägele,Alejandro Hernández-Cano,Atli Kosson,Martin Jaggi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern neural network training relies on optimizers such as Adam and Muon which act on each weight matrix as a single object. Yet every weight matrix carries two distinct quantities – a \emphmagnitude and a \emphdirection – and all optimizers stepping in the matrix as a whole couple their dynamics: the directional change from an update depends on the current magnitude, while the magnitude drifts as a byproduct of learning the direction, so neither is governed directly by the learning rate. Typical training therefore leans on surrounding recipes such as weight decay and warmup to keep learning stable at scale, though these regulate the coupling only indirectly; other recent methods instead constrain the weight to a fixed-norm sphere, but add no learnable magnitude, leaving scale control to normalization layers alone. We propose \emphMagnitude–Direction (MD) Decoupling, an optimizer modification that factorizes each weight into a fixed-norm direction on a hypersphere and learnable per-row and per-column magnitude gains, updated at separate learning rates, all while the model still sees a single fused weight tensor. The method is agnostic to the base optimizer and removes the need for weight decay and warmup. Across both Adam and Muon, MD Decoupling improves on well-tuned baselines, transfers the optimal LR across model width without retuning, and continues to help at scale on large Mixture-of-Experts (MoE) models. Treating magnitude and direction as separately controlled quantities thus yields more predictable training dynamics and a simple, broadly applicable improvement to modern optimizers.

[LG-3] textDT2: Decision-Targeted Digital Twins

链接: https://arxiv.org/abs/2606.25923
作者: Harry Amad,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A digital twin (DT) is a virtual model of a real-world system that can assist decision-making by simulating scenarios induced by different policies. However, typical machine learning-based DTs do not optimise for this use case. We prove that, when model capacity is limited, training DTs to minimise one-step transition errors can produce suboptimal models for ranking sets of policies according to a reward function. We further show that this holds empirically, even with expressive model classes. To address this, we introduce \textDT^2 , a decision-targeted DT training paradigm. Firstly, \textDT^2 uses fitted Q-evaluation to estimate values of candidate policies from offline data. A DT is then trained to generate rollouts that preserve pairwise policy rankings derived from these proxy ground-truth values with an architecture-agnostic loss function. We empirically demonstrate the efficacy of our method across a range of settings and architectures. \textDT^2 consistently improves policy ranking and reduces decision regret during policy selection relative to conventional DT training, both for policies used during training and for unseen policies, while maintaining a good level of raw simulation fidelity.

[LG-4] Variational Autoencoder Layer

链接: https://arxiv.org/abs/2606.25900
作者: Gananath R
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) belong to a family of autoencoders with probabilistic properties, making them well suited for generating data by producing a smooth and continuous latent space. Despite being introduced over a decade ago, the method continues to be widely adopted in both research and industry for diverse applications. While VAEs are typically used as standalone models, this paper introduces a novel approach to integrate them as a neural network layer. Furthermore, a new training strategy is proposed for models incorporating these layers, and their performance is thoroughly analyzed.

[LG-5] A 3D-Printable Dataset for Fair Testing and Comparisons of Tactile Sensors

链接: https://arxiv.org/abs/2606.25886
作者: Dexter R. Shepherd,Nicolas Herzig,Phil Husbands,Andrew Philippides,Chris Johnson,William Kimbell
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing texture datasets for tactile sensing primarily consist of sensor readings from a specific sensor interacting with available surfaces/objects rather than describing the textures themselves, limiting fair comparison between tactile sensors and hindering reproducible research. In this work, we introduce a 3D-printable dataset of mathematically defined textures designed to be fabricated reliably across different printers and filament types. The dataset consists of six parametrically generated surface patterns derived from combinations of sine-wave and Fourier-based functions, giving controlled variation in spatial frequency, amplitude, and directional structure. We evaluate the reproducibility of these textures across three popular 3D printers and multiple filament types by measuring variance in images captured using an optical TacTip sensor under controlled contact conditions. Our results show that print quality, particularly peak sharpness and stringing, affects tactile variance, with higher-end printers producing significantly more consistent signatures. Classification experiments using neural networks and PCA-based models further demonstrate that high-quality prints support strong within-printer generalisation, while cross-printer generalisation remains challenging due to geometric inconsistencies. This work establishes the first openly available, physically reproducible 3D-printed texture benchmark, providing a foundation for fair comparison of tactile sensors.

[LG-6] An Analysis of Posterior Collapse Parameterization and Initialization in Variational Deep Gaussian Processes

链接: https://arxiv.org/abs/2606.25882
作者: Francisco Javier Sáez-Maldonado,Juan Maroñas,Daniel Hernández-Lobato
类目: Machine Learning (cs.LG)
*备注: Submitted to the Journal of Machine Learning Research

点击查看摘要

Abstract:DGPs are probabilistic models with remarkable prediction performance that concatenate GPs across several layers. Exact inference in DGPs is intractable, and variational inference is often used to approximate the posterior with a parametric distribution tuned by minimizing the Kullback-Leibler divergence. Moreover, finding a good VI approximation is challenging. In particular, a problem of VI is posterior collapse, where VI converges to a variational posterior that matches the prior. In variational DGPs, this implies explaining the data as noise. This work studies posterior collapse in DGPs and identifies its connection to the DSVI algorithm and the widely used linear prior mean function employed in all but the last layer. We show that the benefit of the linear prior mean does not arise from avoiding the non-injective pathology in very deep DGPs, as previously believed, but from improving the conditioning of the optimization problem at initialization. Thus, we propose an alternative initialization of a zero prior mean DGP that mimics a DGP with a linear prior mean at initialization. This enables successful training of DGPs without imposing optimization-driven constraints on the prior, allowing to choose the prior based on modeling assumptions rather than optimization convenience. Our analysis considers three common parameterizations of DGPs and shows that not all of them benefit from a linear prior mean. We also explain why a whitened parameterization of the \DGP provides more stable convergence, something often assumed from experience, but lacking a rigorous analysis. Furthermore, we show that this stability is also beneficial to avoid the posterior collapse problem. Extensive experiments validate our findings: the proposed initialization prevents posterior collapse, improves stability, and achieves performance comparable to (and sometimes better than) DGPs with a linear prior mean.

[LG-7] Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization ICLR

链接: https://arxiv.org/abs/2606.24543
作者: Kanishk Awadhiya
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR Workshop 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are traditionally viewed as autoregressive generators. However, from the perspective of collective computation, they function as high-dimensional Dense Associative Memories that store complex reasoning patterns as latent attractors. In this work, we investigate the energy landscape of mathematical reasoning. We posit that correct reasoning chains correspond to deep, wide attractor basins (“flat minima”) in the model’s output distribution, whereas hallucinations manifest as sharp, unstable local minima. To exploit this geometry, we introduce a retrieval mechanism based on a Gibbs measure of the trajectory’s spectral entropy. By sampling multiple reasoning paths and weighting them by their inverse energy ( P \propto e^-\beta E ), we approximate the equilibrium distribution of the associative memory, effectively ``relaxing’’ the system into a robust solution. Empirically, this physics-inspired mechanism improves Microsoft Phi-3.5 performance on GSM8K by 5.38% (84.7% \to 90.1%), demonstrating that inference is better modeled as a dynamic settling process into an attractor basin rather than greedy next-token prediction.

[LG-8] Cellular Predictions on the Move: What about Data?

链接: https://arxiv.org/abs/2606.25709
作者: Natalia Vesselinova,Pauliina Ilmonen
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 12 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Mobile cellular load forecasting is native to network resource optimization and delivery of services with reliability, latency and quality guarantees. The mainstream of machine learning research in the area is focused primarily on developing powerful learning structures for improved prediction accuracy. The data used for forecasting traditionally belong to the cellular domain and at most contain exogenous information about the surroundings of the base stations. We approach the prediction task from the perspective of data as a vital component of any data learning process. We hypothesize that substantial improvements could be achieved when the data inform on the processes that create the cellular load. Specifically, we propose to characterize the population dynamics – the potential number of cellular traffic sources and their mobility – in addition to employing historical time series of mobile data traffic. We validate our hypothesis for the rarely examined highway scenario. Comprehensive experiments show forecasting improvements on the order of 60% due to the use of these data alone.

[LG-9] Memory-Efficient Policy Libraries with Low-Rank Adaptation in Reinforcement Learning

链接: https://arxiv.org/abs/2606.25700
作者: Samuel Valland Lyngset,Tor Viljen Raanaas,Gard Sveipe,Eirik Møller Nilsen,Jim Torresen,Kai Olav Ellefsen,Tobias Lømo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:When fine-tuning Large Language Models (LLMs), there has been success in minimizing both memory usage and computation with Parameter-Efficient Fine-Tuning (PEFT), like Low Rank Adaptation (LoRA). In this article, we have explored whether this approach is transferable to the world of robotics and Reinforcement Learning (RL), allowing learning with reduced memory usage and improved computational performance. Specifically, we focused on a version of multi-task robotics, where a library of specialist policies are created. In such a library memory efficiency is especially important. We used a Proximal Policy Optimization (PPO) algorithm and fine-tuned a baseline model to different tasks using LoRA. Our results demonstrate that, depending on the hyperparameters, LoRA can minimize memory usage by a factor of 20-160 compared to full fine-tuning of all layers. This implies a 90-95% storage saving when deploying a library of many (10-50) specialized policies, which can be the differentiating factor between being able to store the entire library in memory or having to use swap-memory in an applied robotics setting. At the same time, our results indicate that there is no significant difference in the success-rate between full fine-tuning and LoRA fine-tuning for the selected tasks.

[LG-10] Learning Subset-Shared Invariances for Domain Generalization with Mixture-of-Experts

链接: https://arxiv.org/abs/2606.25665
作者: Tien-Hung Nguyen,Tien-Dat Tran,M.-Duong Nguyen,Kok-Seng Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain generalization (DG) aims to learn a model from one or more source domains that generalizes to an unseen target domain without accessing target data during training. A common approach enforces invariance of representations across all source domains, assuming predictive structure is globally shared. However, we demonstrate that enforcing invariance across more domains gradually restricts the feasible representation space, discarding transferable predictive factors that are not universally shared. To address this limitation, we propose subset-shared invariance, where predictive structure is assumed stable only within domain subsets. We implement this principle with a mixture-of-experts architecture, where each expert aligns the specific domains it serves and a routing mechanism composes subset-invariant components for prediction. This creates a routing-conditioned invariance, jointly learned with the representation. To facilitate effective decomposition, we develop training objectives that encourage selective alignment, confident and balanced routing, and diverse expert specialization. Experiments on DomainBed benchmarks demonstrate improved out-of-domain generalization and greater robustness under increasing domain heterogeneity. Our results suggest that DG should move beyond enforcing a single global invariance and instead model invariance through partially shared structure across domain subsets.

[LG-11] Leaking Circuit Secrets: Gradient Leakage Attacks on Graph Neural Networks

链接: https://arxiv.org/abs/2606.25589
作者: Rupesh Raj Karn,Johann Knechtel,Ozgur Sinanoglu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 12 pages

点击查看摘要

Abstract:As graph neural networks (GNNs) become standard tools for critical tasks in circuit design and analysis, their security and privacy risks require careful attention. Here, we present the first comprehensive evaluation of gradient leakage attacks (GLAs) on GNNs in circuit-design and hardware-security tasks, a practical threat that has been largely overlooked. We assess state-of-the-art (SOTA) GNNs, including GraphSAGE, GCN, GIN, and GAT, trained on standard netlist benchmarks (ISCAS’85, EPFL, and TrustHub), for their fundamental vulnerability to GLAs. We find that GLAs can expose sensitive information, such as gate types and distinctive properties of hardware Trojans, which may assist adversaries in analyzing logic locking schemes or evading Trojan detection mechanisms. Our analysis shows that these risks are influenced by architectural features, with attention mechanisms (GAT) exacerbating leakage, while injective aggregation (GIN) provides comparatively stronger resilience. We further evaluate several SOTA defense techniques, including differential privacy, gradient clipping, secure aggregation, model compression with quantization, and adversarial training. We find that these techniques improve resilience only in specific settings and can also compromise model performance. Overall, our work provides key insights toward privacy-preserving GNNs and highlights the need for more robust and efficient defenses. We release our full methodology and artifacts.

[LG-12] Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

链接: https://arxiv.org/abs/2606.25527
作者: Guozheng Ma,Lu Li,Zilin Wang,Pierre-Luc Bacon,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online reinforcement learning (RL) agents increasingly depend on knowledge acquired offline to achieve practical efficiency. Originally studied in offline-to-online RL, this paradigm now spans foundation model post-training and embodied intelligence, with prior types expanding from offline datasets and pre-trained policies to increasingly diverse knowledge sources such as multimodal foundation models and generative world models. Offline priors have become central to how deep RL is developed and deployed. However, this reliance introduces a challenge that the prevailing benchmark-driven paradigm cannot resolve: because prior validity varies across deployments and shifts during training, no single approach to managing it is universally optimal, and benchmark rankings offer limited guidance for real-world deployments. Rather than pursuing universal solutions, we argue that the field should shift to diagnosis-driven tension management, in which deployment-specific evidence guides how the learner relates to its priors throughout training, enabling both flexible and adaptive deployment. We support this position with a framework characterizing how priors reshape online optimization through three functional roles, controlled experiments demonstrating help-or-hurt reversals, cross-domain evidence from foundation model post-training to embodied intelligence, and engagement with five substantive counterarguments.

[LG-13] Distill on a Diet: Efficient Knowledge Distillation via Learnable Data Pruning ECCV2026

链接: https://arxiv.org/abs/2606.25488
作者: Yifan Wu,Yiqi Wang,Xichen Ye,Wenjing Yan,Xiaoqiang Li,Cheng Jin,Xiangyu Yue,Weizhong Zhang
类目: Machine Learning (cs.LG)
*备注: Acceepted by ECCV 2026

点击查看摘要

Abstract:Knowledge Distillation (KD) is widely used to obtain compact models for efficient inference in resource-constrained environments. Yet the computational overhead of the distillation process itself is often overlooked, raising the question of whether a better student model can be obtained with less data and less compute via data pruning. However, existing data pruning methods are not designed for KD: some introduce substantial overhead, such as obtaining training dynamics through retraining, while others rely on heuristic selection rules that fail to capture what KD actually requires, often resulting in suboptimal subsets. To address these issues, we propose IF-Beta, an efficient data pruning framework that combines influence functions with a learnable sampling policy. Empirically, we first demonstrate that influence functions can serve as an effective and efficient estimator of sample impact in KD settings, where only a pretrained teacher is available. Building on this, our sampling policy is specifically parameterized by a Beta distribution, whose highly flexible two-parameter family allows the policy to adapt to diverse pruning regimes rather than being tied to fixed heuristic forms. Next, we formulate KD pruning as optimizing this policy through a bilevel objective, where the inner loop operates in the teacher feature space with a KD-aligned objective, enabling fast proxy training, while the outer loop updates the policy parameters to maximize distillation performance. This design ensures that IF-Beta is both computationally efficient and inherently aligned with the goals of KD. Extensive experiments on CIFAR-10/100 and ImageNet show that IF-Beta consistently outperforms other baselines across a wide range of pruning ratios. Remarkably, IF-Beta enables students trained on less data and less compute to surpass the performance of students distilled on the full dataset.

[LG-14] owards Robust EEG Decoding Based on Riemannian Self-Attention KDD2026

链接: https://arxiv.org/abs/2606.25456
作者: Shaocheng Jin,Tao Zhou,Rui Wang,Ziheng Chen,Xiaoqing Luo,Xiaojun Wu,Josef Kittler
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026

点击查看摘要

Abstract:Brain-Computer Interface (BCI) based on electroencephalography (EEG) enables direct interaction between the brain and external environments and has significant applications in assistive technologies, medical rehabilitation, and entertainment. Recently, EEG decoding methods based on Symmetric Positive Definite (SPD) learning have demonstrated superior performance. However, these methods typically employ basic network architectures and do not explicitly capture local relationships between EEG signals. This limitation is problematic for EEG signals due to their inherently low Signal-to-Noise Ratio (SNR). Moreover, most existing Riemannian manifold-based methods are restricted to specific metrics. The most widely used is the Affine-Invariant Metric (AIM). However, it has a quadratic dependency on the SPD matrices and cannot handle ill-conditioned SPD matrices, which hinders the effectiveness of networks. In contrast, the Bures-Wasserstein Metric (BWM) exhibits linear dependence on SPD matrices and demonstrates superior performance for ill conditioning. To overcome these challenges, we propose a Riemannian self-attention network based on the BWM. Additionally, the recently introduced power-deformed generalized Bures-Wasserstein metric reveals a nonlinear relationship between SPD matrices and matrix power deformation. This metric provides a more nuanced representation of the geometric structure of the SPD manifold. Consequently, we extend our model to a learnable version. For simplicity, we refer to it as GBWAtt. Experimental results on three EEG benchmarking datasets validate the robustness and effectiveness of our proposed method. The code is available at this https URL.

[LG-15] DFMU: Data-Frugal Machine Unlearning

链接: https://arxiv.org/abs/2606.25410
作者: Sajith U,Prateek Keserwani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning is an emerging domain that ensures the safe removal of elements (includes concepts, attributes, entity and class) from the trained model along with least drop in model performance. The domain of machine unlearning brings its own indigenous challenges since the removal of pre-trained elements from model will always degrade the model performance on remaining elements. The existing methods basically rely on retraining for removal of elements from the pre-trained model, which is compute extensive. In this work, we propose a machine unlearning method which helps to reduce the computational requirement for faster retain-dataset accuracy convergence which also does not require extensive retraining of the pre-trained model. The proposed method, Data-Frugal Machine Unlearning (DFMU) requires only a single forward and backward pass for computing the importance score of various computational blocks of a model. The importance score computation is based on knowledge preserving pruning which helps to converge faster and requires far less data as compared to the existing methods. Experimentally, it achieves 40% more retain-accuracy with just 13% of data samples in comparison with SOTA method on various public datasets and also averages 88% faster processing time for forgetting a given class.

[LG-16] Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention

链接: https://arxiv.org/abs/2606.25342
作者: Luke McDermott,Robert W. Heath jr.,Rahul Parhi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lifelong continual learning remains an obstacle on the path to human-like intelligence. Modern transformers show sparks of intelligence with in-context learning. The quadratic nature of attention, however, prohibits transformers from performing this process on arbitrarily long sequences. In this work, we argue that extending in-context learning to lifelong settings is a practical solution for continual learning in AI agents. In particular, we argue that \emphparametric forms of attention are needed to understand a lifetime of context with transformers on a fixed hardware budget. These attention mechanisms learn the relationship between keys and their associated values at test-time with parametric regression. Our generalization of parametric approaches (linear attention, state-space models, fast weight programmers, and test-time training layers) contrasts with nonparametric counterparts like softmax attention. They replace the ever-growing key-value cache with an online-trainable neural network, maintaining a constant memory footprint. We highlight how parametric attention currently fall short of lifelong learning due to limited memory capacity or costly online updates. To address these issues, we pose a set of open questions with novel insights to guide the field toward long-horizon agents.

[LG-17] Stagnant Neuron: Towards Understanding the Plasticity Loss in Multi-Agent Reinforcement Learning Value Factorization Methods

链接: https://arxiv.org/abs/2606.25335
作者: Zhengzhu Liu,Zeming Gao,Haoyuan Qin,Jiawei Hu,Junhao Wu,Miao Zhu,Haipeng Zhang,Chennan Ma,Siqi Shen,Cheng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) value factorization methods can suffer from a loss of plasticity, gradually failing to adapt when transferring to new task instances. We trace this issue to stagnant neurons, units whose gradient updates become negligibly small relative to their weights, thereby hindering learning. While existing plasticity injection methods exist, they prove ineffective for such neurons. To address this, we propose Knowledge-retentive Neuron-level PlastIcity Focusing InjEction (KNIFE), a novel method that directly targets stagnant neurons. KNIFE replaces each stagnant neuron with a composite unit comprising three specialized components: a frozen knowledge neuron to preserve acquired knowledge, a re-initialized active neuron to restore learning capacity, and a compensation neuron to ensure the combined output matches the original, thus maintaining previous learned cooperation knowledge. Extensive experiments on SMACv2, predator-prey, and matrix games demonstrate that KNIFE significantly outperforms state-of-the-art plasticity injection methods.

[LG-18] Inverse Reinforcement Learning for Interpretable Keystroke Biomarkers in Parkinsons Disease

链接: https://arxiv.org/abs/2606.25270
作者: Navin Bondade
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:Keystroke dynamics have been explored extensively as a passive digital biomarker for Parkinson’s disease (PD), typically by extracting summary statistics from typing timing and training a classifier to discriminate PD from healthy controls. We instead apply inverse reinforcement learning (IRL) to keystroke data, modeling each keystroke as a discrete choice over typing speed and recovering, per subject, an interpretable reward function that explains their observed timing behavior. To our knowledge this is the first application of IRL to keystroke dynamics. On the public neuroQWERTY MIT-CSXPD dataset (85 subjects, 42 with PD), an initial four-parameter reward decomposition (speed, effort, smoothness, hand-alternation cost) was found to suffer severe feature collinearity between two terms ( r=1.000 in typical contexts); we diagnose and correct this, yielding an identifiable three-parameter model. The recovered speed-preference weight correlates with UPDRS-III severity at r=-0.607 ( p0.001 , n=42 ), replicates independently across two sub-cohorts, is stable across nine sensitivity configurations, and retains a statistically significant contribution beyond raw typing speed alone (incremental R^2 from 0.194 to 0.338, p=0.006 ). Two other recovered weights (consistency, hand-alternation) did not survive confound checks and are reported as negative results. We document two implementation bugs found during adversarial code review (session-boundary contamination, a rolling-window data leakage) and show the headline result is materially unchanged after fixing both. We discuss this result in the context of a literature where reported accuracies vary widely between studies (pooled AUC 0.85, I^2=94% in a 2022 meta-analysis), and argue that the validation process itself, not only the correlation coefficient, is part of the contribution.

[LG-19] Variational Inference via Entropic Transport Descent

链接: https://arxiv.org/abs/2606.25265
作者: Vincent Pacelli,Akash Ratheesh,Evangelos Theodorou
类目: Machine Learning (cs.LG)
*备注: 28 pages, 1 figure

点击查看摘要

Abstract:Particle-based variational inference (ParVI) methods approximate an intractable target distribution by evolving an ensemble of interacting samples. Existing approaches rely predominantly on kernel-based repulsion (e.g., SVGD), which suffers from variance collapse in high dimensions and mode collapse on multimodal targets – pathologies caused by the absence of global transport structure. We introduce entropic transport descent (ETD), a ParVI family that frames each particle update as an entropy-regularized optimal transport problem. Derived from the JKO proximal scheme by lifting to the space of couplings and relaxing via the KL chain rule, each ETD iteration reduces to a Sinkhorn computation. The resulting transport plan provides global coordination, guiding each particle to nearby high-density proposals and naturally preserving multimodal structure. ETD can operate entirely score-free, requiring only pointwise evaluations of the unnormalized target density. Experiments on variance-collapse diagnostics, Bayesian logistic regression, neural networks, and molecular Boltzmann distributions show that ETD matches or outperforms SVGD, AGF-SVGD, and SGLD, with the largest gains in high-dimensional and multimodal settings.

[LG-20] Efficient Adaptive Data Acquisition via Pretrained Belief Representations

链接: https://arxiv.org/abs/2606.25197
作者: Daolang Huang,Zhuoyue Huang,Conor Hassan,Luigi Acerbi,Samuel Kaski,Tom Rainforth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Learning effective policies for adaptive data acquisition remains challenging: posterior-based methods rely on surrogate models and posterior approximations that can be misspecified or biased, while direct policy-learning methods map from historical observations and fail to exploit available model representations, making learning harder. We introduce policy learning with belief representations (POLAR), based on the insight that optimal data acquisition depends on the observation history only through a sufficient belief state. Specifically, POLAR decouples representation learning from policy learning by leveraging pretrained predictive foundation models as belief-state encoders, training a policy head on top of their representations. This yields a simple, unified amortised policy learning framework for Bayesian experimental design, Bayesian optimisation, and active learning, differing only in the task-specific utility used to train the policy. Empirically, we find that POLAR outperforms state-of-the-art amortised methods across diverse tasks while requiring far fewer training samples, demonstrating a significant step in the scalability and efficiency of amortised data acquisition.

[LG-21] Efficient Analytic Uncertainty Quantification for Multi-Modal Regression

链接: https://arxiv.org/abs/2606.25188
作者: Kun Jin,James Harrison,Jiawei Li,Sihan Liu,Jiayi Liu,Randolph Linderman,Yuening Li,Arnab Bhadury,Sourabh Prakash Bansod,Liang Liu,Jasper Snoek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient uncertainty quantification (UQ) is essential for trustworthy large-scale learning. Existing UQ methods for regression tasks mainly operate under the assumption that the conditional label marginal satisfies single-peak parametric models, e.g., Gaussians, where the negative log-likelihood function simplifies to the mean square error. However, such single-peak assumptions fail in regression tasks featuring multi-modal distributions. On the other hand, semi-parametric methods which achieve strong regression performance for multi-modal distributions often lack efficient quantification on their prediction variances. In this work, we extend UQ techniques based on Variational Bayesian Inference (VBI) to two widely used semi-parametric regression models that yield histogram-like reconstructions of the conditional label densities: Quantile Regression (QR) and Classification Restoration (CR). Our approach introduces a unified, distribution-agnostic framework that simultaneously achieves accurate estimation of complex conditional distributions and highly efficient UQ. Theoretically, our method is grounded in novel formulations of QR and CR within the VBI framework, yielding analytic Evidence Lower Bounds (ELBO) to streamline training and a closed-form or analytically approximated predictive density for efficient inference. Empirically, we evaluate our methods on three large-scale regression benchmarks with multi-modal label distributions. Our framework outperforms state-of-the-art multi-modal regression baselines, and even matches predictive performance of computationally expensive ensemble models. Furthermore, by leveraging epistemic uncertainty estimation, our approach enables highly data-efficient active learning strategies.

[LG-22] Neural operator-based digital twins for modeling amyloid-β and tau propagation and treatment optimization in Alzheimers disease

链接: https://arxiv.org/abs/2606.25185
作者: Xiaofeng Xu,Tingting Dan,Zifan Zhou,Bin Li,Guorong Wu,Wenrui Hao
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Accurately predicting the spatiotemporal evolution of amyloid- \beta and tau proteins at the individual level is critical for improving the diagnosis and treatment of Alzheimer’s disease. We consider the problem of constructing patient-specific digital twins that model the propagation of these biomarkers on the cortical surface using reaction–diffusion dynamics. A major challenge is that the underlying nonlinear aggregation mechanisms are unknown and must be inferred from sparse, noisy, and heterogeneous longitudinal PET imaging data. To address this, we develop a data-driven framework that learns biomarker dynamics directly from clinical observations. The approach combines operator learning with reduced-order representations to infer governing equations of disease progression from data. Using this framework, we achieve predictive accuracies of 87% for amyloid- \beta and 81% for tau. Building on the learned dynamics, we further formulate a PDE-constrained optimal control problem to design personalized therapeutic strategies that regulate pathological protein propagation. By integrating data-driven dynamical modeling with treatment optimization, the proposed digital twin framework provides an interpretable and predictive platform for understanding disease progression and enabling precision interventions in neurodegenerative disorders.

[LG-23] he Gentle Collapse: Distributional Metrics for Continual Learning

链接: https://arxiv.org/abs/2606.25165
作者: Ahmed Anwar,Andreas Wagner,Federico Raue,Tobias Nauen,Andreas Dengel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accuracy degradation is the standard metric for Catastrophic Forgetting (CF), however, it records only whether forgetting occurred or not. It saturates at the extremes and collapses discretely at task boundaries, hiding the internal structure of what is being forgotten. We introduce six softmax-derived metrics spanning true-label rank (TLR), predictive confidence, and distributional divergence that characterize forgetting continuously, each normalized to [0, 1] with no modification to training. On CIFAR-100, these metrics carry information where accuracy does not: at 0% accuracy, the Confusion Margin spans an IQR of [0.32, 0.50] across classes that accuracy treats identically. We demonstrate that this richer signal is actionable in mitigating catastrophic forgetting. Per-sample metric scores used as loss weights reduce forgetting by 1.3 percentage points over uniform experience replay (ER) on CIFAR-100. Furthermore, the slope of a metric over a small window provides a stable sampling criterion: at a small-window size (e.g. 3 epochs), accuracy-trend degrades to 34.79% (std. = 2.32) while log-TLR achieves 41.07% (std. = 0.57). This gap is structural since reliable small-window trend estimation requires a continuous signal. On TinyImageNet, log-TLR trend sampling reduces forgetting by 7.7 percentage points over the ER baseline.

[LG-24] Forget to Improve: On-Device LLM -Agent Continual Learning via Budget-Curated Memory

链接: https://arxiv.org/abs/2606.25115
作者: Beining Wu,Zihao Ding,Jun Huang,Yanxiao Zhao
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:On-device language-model agents improve by accumulating experience in retrieved memory rather than by updating weights. This memory is hard-bounded and exposed: it consumes RAM and energy, reaches peers through a thin uplink, and becomes an attack surface because it is writable by what the agent reads. Existing systems each cover one part of this problem: agentic memories grow without a budget, on-device methods keep entries by success alone, and poisoning is studied mainly as an attack rather than as a memory-governance problem. We propose \sys, a single net-value-per-byte score that governs an agent’s experience-memory lifecycle. The main idea is to let the budget act as the curator: each entry is scored as value minus harm, per byte, so one ruler decides what to keep, share, and trust. \sys makes three decisions: (1) \textbfKEEP evicts low-value bytes under the RAM and energy budget; (2) \textbfSHARE sends an insight only when its value exceeds its uplink cost; and (3) \textbfTRUST gates a peer entry by provenance. On language-model-agent task-drift benchmarks and a real heterogeneous Jetson testbed with two robot-arm nodes and a hub, \sys reduces memory by 2.7\times and uplink by 2.4\times , drives injection success from 0.75 to zero, and raises accuracy on cases corrupted by poison or stale memory. Curating by net value reduces footprint, energy, uplink, and injection success together without reducing accuracy. In this setting, forgetting by net value improves the agent rather than weakening it.

[LG-25] A Framework for Directed Hypergraph Signal Processing via tensor t-SVD

链接: https://arxiv.org/abs/2606.25112
作者: Carlos Mundo-Levano,Nicolás Bello,Daniel L. Lau,Gonzalo R. Arce
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages, 6 figures. Presented as an oral presentation at the 9th Graph Signal Processing Workshop (GSP 2026), June 8-10, 2026, Madrid, Spain

点击查看摘要

Abstract:We introduce Directed Hypergraph Signal Processing (DHGSP), a unified framework that extends graph signal processing to accommodate both higher-order (polyadic) and asymmetric (directional) relationships simultaneously. Using the tensor singular value decomposition (t-SVD) within the t-product algebra, we define a novel adjacency tensor for directed hypergraphs, a topologically faithful shift operator, and a lossless Directed Hypergraph Fourier Transform (t-DHGFT). Experiments on real traffic networks demonstrate that DHGSP outperforms matrix-based (graph and digraph) and undirected tensor-based (hypergraph) baselines in denoising tasks.

[LG-26] GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series

链接: https://arxiv.org/abs/2606.23880
作者: Mohammad Fesanghary,Abhinav Havaldar
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:From climate teleconnections to gene regulation, modern time-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging. Constraint-based methods offer statistical rigor but their nonlinear CI tests are infeasible at scale, while score-based alternatives avoid CI testing but require arbitrary thresholds to binarize continuous edge scores. We propose GRACE ( \textbfG ated \textbfR efinement for \textbfA ccurate \textbfC ausal \textbfE dge discovery), which refines constraint-based discovery using Hard Concrete gates with L_0 regularization: each candidate edge has an independent gate whose values concentrate near 0 or 1, yielding a clean bimodal separation that makes the binary decision robust, unlike the narrow, overlapping score distributions produced by L_1 and attention-based methods. A fast linear CI skeleton provides high-recall candidates; a single gated model then prunes false positives by learning which edges genuinely improve prediction, with automatic regularization adapted to problem dimensions and skeleton density. Systematic experiments on synthetic benchmarks, spanning diverse graph topologies (scale-free, Erdős-R’enyi, small-world) and dimensionalities up to d=100 , show that GRACE substantially improves F1 over its base CI method while maintaining high precision, and outperforms attention-based and score-based alternatives. GRACE matches or exceeds expensive nonlinear CI tests at a fraction of the cost ( 75\times faster). On a real-world river flow dataset, where rainfall confounders, variable propagation lags, and distributional shifts violate standard assumptions, a temporal bootstrap variant of GRACE recovers 9 of 11 causal edges along the Elbe River with only 1 false positive ( F_1 = 0.86 , AUROC = 0.99 ), reducing the skeleton’s 106 false positives by 99%.

[LG-27] Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data

链接: https://arxiv.org/abs/2606.23871
作者: Natalia Moreno-Blasco,Anusha Ihalapathirana,Pekka Siirtola,Miguel Fernandez-de-Retana
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Survival analysis is central to clinical decision-making, yet reliable time-to-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of patient data. Federated learning (FL) offers a privacy-preserving alternative by training shared models without exchanging raw data, but its effectiveness for survival modeling under realistic, heterogeneous conditions remains insufficiently understood. This paper presents a systematic, multi-model evaluation of federated survival analysis on a cross-institutional breast cancer cohort with naturally heterogeneous distributed clients. Three representative survival models, the Cox Proportional Hazards model, DeepSurv, and Random Survival Forest (RSF), are compared across centralized, local, and federated training, and three federated optimization strategies (FedAvg, FedProx, and FedAdam) are assessed for the gradient-based models. Results show that FL consistently outperforms local training and approaches, and occasionally exceeds, centralized performance, while RSF offers the best overall balance of discrimination, calibration, and robustness across heterogeneous clients. We further find that performance depends on the diversity of client distributions, and that FedAvg and FedProx are stronger and more stable than FedAdam. Based on these findings, we derive practical, decision-oriented guidelines mapping data, privacy, interpretability, and resource constraints to recommended model and training-paradigm choices for federated survival modeling in healthcare.

[LG-28] Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling

链接: https://arxiv.org/abs/2606.23867
作者: Trenton Lau,Gary P. T. Choi
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The exact computation of the Normalized Maximum Likelihood (NML) codelength for regular non-smooth estimators (e.g., Lasso) has been historically limited by the cubic scaling walls of manifold-constrained projection and volume integration. At each step of the geometric Propose-and-Project Metropolis–Hastings (PPMH) sampler, evaluating the projection operator requires inverting an (N+k) \times (N+k) generalized KKT matrix, while calculating the volume factor requires the determinant of an (N-k) \times (N-k) Gram matrix. This paper presents an exact, mathematically equivalent formulation that bypasses both bottlenecks by utilizing the block Schur complement and Sylvester’s determinant identity. We prove that the computational complexity of both operations collapses from \mathcalO(N^3) to \mathcalO(k^3 + N^2 k) per step. We generalize this reduction to Sparse Support Vector Machines (SVMs), Elastic Net, and Group Lasso. Finally, we provide a rigorous numerical stability analysis and evaluate the sampler’s efficiency using the Effective Sample Size (ESS) per second. Our empirical benchmarks on high-dimensional datasets confirm a constant speedup exceeding 14,100\times while maintaining double-precision numerical equivalence, rendering exact non-smooth NML estimation highly tractable for large-scale statistical inference.

[LG-29] Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning

链接: https://arxiv.org/abs/2606.23856
作者: Konstantin Yatsenko,Arvind Thiagarajan
类目: Machine Learning (cs.LG)
*备注: 24 pages, 4 figures, preprint

点击查看摘要

Abstract:Generative molecular models for drug design are a promising direction with much active research. In the next phase of computational drug design, such models will need to understand small molecule structure and protein-ligand interactions, and they will need to possess the machinery to generate molecules de novo. Incorporating each feature poses a critical challenge. Equally important, yet often treated as secondary, is the ability to grow a molecule from a partial starting point – a scaffold or fragment supplied by a chemist – which is the central operation of lead optimization. We present Sesame (Spatial Evoformer for a Structure-Aware Molecular Engine), a diffusion-based molecular generation model that leverages a novel spatial pairformer module to condition on partial molecular structure and the surrounding protein pocket, both expressed as continuous spatial density maps. This single conditioning mechanism supports both de novo generation and fragment-conditioned lead optimization, letting a medicinal chemist prune a hit to a scaffold and have Sesame grow it in productive ways. In addition to this module, we also introduce a diffusion framework for joint denoising of atom types, bond types, and positions, along with a trajectory finetuning scheme that trains on the model’s own sampling rollouts to improve generation quality. Sesame is trained on a large corpus of ligand-only and protein-ligand datasets.

[LG-30] he Degeneracy Distillery

链接: https://arxiv.org/abs/2606.23838
作者: T. Lucas Makinen,Deaglan J. Bartlett,Niall Jeffrey,Benjamin D. Wandelt
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 30 pages, 10 figures. Supporting code found at this https URL

点击查看摘要

Abstract:When two or more parameters or labels produce similar data, they are degenerate, or hard to distinguish. Degeneracies render both label prediction and inverse problems difficult, since both machine learning algorithms and probabilistic samplers rely on the distinguishability of data and its gradients with respect to parameters. However, identifying degeneracies in physical models or real-world datasets can be elucidating about the choice of model or the underlying process that produces the data. We present the degeneracy distillery, a method that (1) detects and (2) resolves degenerate parameter combinations (a) automatically and (b) symbolically, from parameter-data (or parameter-simulation) pairs alone, through estimation and flattening of the Fisher information matrix. By exploring the information geometry of the likelihood, we characterize degeneracies as an intrinsic property of the physical model, requiring no realised data observation. We demonstrate our approach on a range of synthetic and real-world problems, discovering symbolic coordinate transformations that identify the combinations of parameters of a model which yield independent effects on the data. The resulting coordinates flatten the Fisher information in expectation globally, in contrast to posterior-based methods that flatten only at a single point, and substantially reduce the simulation budget required for downstream neural posterior estimation. In test cases we require up to 10\times fewer simulations for posterior estimation at matched validation calibration whilst simultaneously gaining physical insight on the system.

[LG-31] Reconstructing GRACE Terrestrial Water Storag e with Spatio-Temporal Graph Neural Networks: An Application to South America

链接: https://arxiv.org/abs/2606.23833
作者: Lukas Arzoumanidis,Lara Johannsen,Klara Middendorf,Annette Eicker,Youness Dehbi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Terrestrial water storage (TWS) integrates snow, soil moisture, surface water, and groundwater and is a key indicator of how climate variability and human activity reshape the global water cycle. The GRACE and GRACE-FO satellite missions provide the only direct, globally consistent observations of TWS change, but their record only begins in 2002 which is too short for many climate-scale analyses. We present a deep learning application that reconstructs monthly GRACE-like TWS anomalies (TWSA) back to 1940 by learning the relationship between daily ERA5 meteorological forcing (precipitation, evapotranspiration, runoff) and monthly GRACE observations. In contrast to prior reconstruction approaches based on grid-cell-wise regression, CNNs, or LSTMs, we adapt a multi-variate time series graph neural network (MTGNN) architecture, which was originally developed for mobility and traffic forecasting on urban sensor networks to this satellite-geodesy task. Spatial dependencies are encoded in a static, interpretable hybrid adjacency matrix that combines geodesic proximity with lagged correlations of climatic time series, capturing both local hydrological coupling and large-scale teleconnections. The reconstruction achieves a grid-cell Pearson correlation of 0.69, a basin-mean correlation of 0.94, and a near-zero bias, and it reproduces the spatial fingerprints of the 2015/16 El Niño and 2020/21 La Niña events. A systematic comparison with established reconstruction approaches (GTWS-MLrec, RM-REC, GRAiCE) shows that the graph-based model is statistically competitive at basin scale, reaching a correlation within 0.025 of the best baseline while using only roughly half to a tenth of the predictors the other models require and revealing characteristic weaknesses in arid regions in all models. The complete implementation is publicly available at this http URL

[LG-32] One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen with a Parameter-Free Compression Baseline

链接: https://arxiv.org/abs/2606.23767
作者: Wietse Stienstra
类目: Machine Learning (cs.LG)
*备注: 15 pages, 1 table. Code, pre-registrations and per-pair outputs: this https URL

点击查看摘要

Abstract:Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors’ own protocol – different pair subsets, weightings, model-selection, and decision rates. We argue this is the wrong comparison and run the right one: a same-hands re-evaluation in which every method is run by us on the identical 102 pairs, with one strict rule – no tuning and a decision forced on every pair. As a clean reference point we introduce a deliberately minimal baseline: sorted-conditional compression, which feeds quantized, sorted, first-differenced data to an off-the-shelf compressor (bz2) and has zero fitted parameters. Under the common ruler the ranking differs sharply from the literature. Our baseline reaches 74.7% weighted accuracy (p = 3.7e-7); on the same 100 pairs that SLOPE is evaluated on it scores 76.0%, a 1.2-point gap below the authors’ own forced-decision SLOPE (77.2%) that is well inside noise (McNemar p = 0.39). A faithful re-run of RECI lands at 70.7% – inside the original authors’ reported error bar, not the 77.5% often quoted (which we trace to a mis-copied cell). SLOPE’s published 82.4% is a decided-subset figure: scoring the authors’ own stored output only on the pairs its significance test chose to answer reproduces 81.7%. Under the common ruler the methods cluster in the low-to-mid 70s and the zero-parameter compressor ties the strongest of them. We document the mechanisms that inflate published figures (test-set model selection, significance-gated abstention) and contribute two further results: compression score magnitude is a model-free confounding flag (p = 2.8e-68), and a pre-registered falsification test fails in an instructive way that bounds the method’s theoretical interpretation. Code, pre-registrations, and per-pair outputs are released.

[LG-33] Verifiable Foundation Models for Robot Safety

链接: https://arxiv.org/abs/2606.23754
作者: Davide Corsi,Kyungmin Kim,Roy Fox
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable for existing verification tools. In this paper, we present FEARL (Foundation-Enabled Assured Robot Learning), a framework that addresses this tension through a modular architectural decomposition. FEARL separates the policy into a large Controller © responsible for high-dimensional perception and task reasoning, and a small Safety module (S) that receives low-dimensional observations from dedicated safety sensors together with a bounded context embedding from C and produces the final action. Since many robot safety requirements, such as collision avoidance and workspace boundary constraints, can be expressed over these safety sensor observations, formal verification can be applied to S rather than to the full foundation-model backbone. This makes formal analysis tractable with existing tools while preserving the Controller’s expressive power for task reasoning. To show that the decomposed policy remains capable of solving diverse tasks, we evaluate FEARL on three simulated robotic domains using multiple Controller backbones and training procedures, including pretrained off-the-shelf vision-language-action models. We further transfer the learned policy from one of our simulated tasks to a physical robot, suggesting that the low-dimensional safety interface supports practical sim-to-real transfer.

[LG-34] Low-Cost High-Order Singular Value Decomposition for Tensor-Based Reconstruction from Sparse Sensor Measurements: Urban Flow and Air-Quality Applications

链接: https://arxiv.org/abs/2606.24989
作者: Arindam Sengupta,Paul Jeanney,Ricardo Vinuesa,Jose Miguel Perez,Soledad Le Clainche
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban flow and air-quality simulations generate high-dimensional datasets describing velocity and pollutant transport across multiple spatial, temporal, and physical-variable dimensions. Reconstructing these fields from sparse sensor measurements is a fundamental challenge in environmental monitoring, digital twins, forecasting, and data assimilation. Existing low-cost reconstruction approaches are commonly based on matrix decompositions, which require multidimensional datasets to be flattened into two-dimensional snapshot matrices, thereby discarding important structural information. This work introduces the low-cost High-Order Singular Value Decomposition (lcHOSVD), a novel tensor-based sparse-sensing reconstruction framework for high-dimensional environmental fields. To the authors’ knowledge, this is the first methodology that combines sparse sensing and HOSVD for field reconstruction. Unlike matrix-based approaches, lcHOSVD preserves the natural tensor structure of the data, enabling the exploitation of correlations across spatial, temporal, and physical-variable dimensions while substantially reducing the computational requirements of conventional HOSVD. The methodology is applied to urban flow and air-quality datasets, where three-dimensional velocity and pollutant concentration fields are reconstructed using only 1-4% of the available spatial locations. While lcSVD provides larger computational speed-ups, lcHOSVD consistently achieves lower reconstruction errors in configurations characterized by strong multidimensional coupling and heterogeneous dynamics across dimensions. Additional sensor-anisotropy analyses demonstrate that the tensor formulation is significantly more robust to uneven sensor distributions, a common situation in practical environmental monitoring networks.

[LG-35] Latent Block-Diffusion Temporal Point Processes: A Semi-Autoregressive Framework for Asynchronous Event Sequence Generation

链接: https://arxiv.org/abs/2606.24982
作者: Shuai Zhang,Yancheng Chen,Chuan Zhou,Yang Liu,Xixun Lin,Xiangyu Zhao,Jun Zhu,Zhi-Ming Ma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modeling and sampling from the underlying distribution of asynchronous event sequences are crucial in various real-world applications, including social networks, medical diagnosis, and financial transactions. Existing autoregressive methods suffer from error accumulation during multi-step generation, while non-autoregressive diffusion methods are typically limited to fixed-length output sequences. In this paper, we propose Latent Block-Diffusion Temporal Point Processes (LBDTPP), a novel semi-autoregressive TPP framework that introduces a latent block diffusion mechanism for high-quality and variable-length event sequence generation. The core idea is to define an autoregressive probability distribution over event blocks in latent space and perform Gaussian diffusion within each block. By sequentially generating blocks while simultaneously sampling events in each block, LBDTPP preserves the length flexibility of autoregressive TPPs and inherits the parallel high-quality generation capability of diffusion models. Theoretically, we derive Wasserstein error bounds showing that, under suitable local approximation and prefix-stability assumptions, block-wise generation can reduce error accumulation compared with event-wise autoregressive generation. Extensive experiments on six real-world benchmark datasets demonstrate that LBDTPP outperforms state-of-the-art TPP baselines in both unconditional and conditional generation tasks. Further empirical analyses verify the benefits of latent-space diffusion and block-wise generation, and reveal the trade-off between generation quality and block size. Our code is available at this https URL.

[LG-36] A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak–Ruppert Averag ing

链接: https://arxiv.org/abs/2606.24981
作者: Wei-Cheng Lee,Francesco Orabona
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study linear TD(0) under Markovian sampling, where data are generated along a single trajectory. We provide high-probability guarantees for a plain unprojected TD(0) algorithm with Polyak-Ruppert (PR) averaging, using a single stepsize schedule \eta_t \propto \frac1\tau_\mathrmmix\log(t)\sqrtt that depends on the mixing time but requires no prior knowledge of the curvature parameter \omega . Our first result shows that such a choice of the stepsize guarantees that the TD(0) iterates are automatically and uniformly bounded with high probability, without projections and without any stability argument based on \omega . Building on this result, we establish a simultaneous high-probability convergence guarantee for the PR average: the same stepsize yields both a robust curvature-free \widetilde\mathcalO!\left(\frac\tau_\mathrmmix\sqrtT\right) rate and a fast curvature-dependent \widetilde\mathcalO!\left(\frac\tau_\mathrmmix^2\omega T\right) rate, with the bound taking the minimum of the two. The core technical ingredient is a Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes Markov noise into a martingale term plus a controlled remainder and enables a new self-bounding inductive argument for pathwise stability.

[LG-37] Closed-Loop Graph Algorithm Execution with Small Language Models: Step Accuracy and Rollout Reliability

链接: https://arxiv.org/abs/2606.24980
作者: Michal Podstawski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small language models offer an efficient alternative to large-scale systems, but their ability to execute structured algorithms over multiple dependent decisions remains poorly understood. We study graph algorithm execution as a closed-loop prediction problem in which a model repeatedly selects the next action from the current graph and algorithmic state. Our evaluation framework covers several classical graph procedures, multiple synthetic graph families, and disjoint training, validation, and test partitions. It assesses both local decision quality and global execution behaviour using step accuracy, exact rollout accuracy, constraint validity, partial solution quality, prefix survival, and intervention-based diagnostics. The results show that adaptation can produce reliable policies for structural procedures such as traversal and coloring, while weighted algorithms remain substantially more sensitive to error accumulation. More broadly, the findings demonstrate that strong next-step prediction does not necessarily translate into reliable autonomous execution and motivate evaluating algorithmic language models through complete closed-loop rollouts rather than isolated decisions.

[LG-38] CKM-Driven Communication-Aware UAV Intelligent Trajectory Optimization for Urban Inspection

链接: https://arxiv.org/abs/2606.24979
作者: Yang Xiaomeng,Jia Ziye,Zhu Qiuming,Wu Qihui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) are increasingly employed in urban inspection tasks, where reliable communication is critical but challenging due to the severe spatial channel heterogeneity. To address the issue, in this paper, we focus on the communication-aware path planning for multi-UAV tasks, and propose a channel knowledge map (CKM)-driven trajectory planning framework which integrates the channel modeling and trajectory decision-making. Specifically, we apply the diffusion model to construct a time-accumulated CKM and achieve the accurate perception with low flight overhead, which leverages the sparse observation data to reconstruct the high-fidelity global channel quality distribution. Based on the CKM, we propose a global-to-local graph attention network soft actor-critic algorithm. The graph attention network optimizes the complex combinatorial node ordering problem, generating an optimal and communication-aware sequence for the inspection targets. Subsequently, the soft actor-critic algorithm performs continuous action control to ensure the smoothness of the flight path and dynamically avoid communication attenuation areas. Simulation results demonstrate that the proposed method effectively guides UAVs through high-quality channel regions without dependence on real-time channel feedback, significantly improving both the trajectory efficiency and communication reliability.

[LG-39] Auto-Configured Explainable Graph Neural Networks for Multi-Site Pollution Prediction

链接: https://arxiv.org/abs/2606.24978
作者: Abdelkader Dairi,Fouzi Harrou,Ying Sun
类目: Machine Learning (cs.LG)
*备注: 22 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Accurate particulate matter (PM) prediction is crucial for mitigating air pollution. Graph Neural Networks (GNNs) effectively model spatiotemporal dependencies, but predefined graphs limit adaptability, and some datasets complicate learning. This study introduces a graph construction method based on a confusion matrix from a supervised learning process to dynamically capture inter-class relationships. Additionally, a hybrid loss function that combines energy distance and Huber loss is applied to address the vanishing gradient problem and improve learning stability. The approach is evaluated using air pollution data from the University of Utah AirU Pollution Monitoring Network in Salt Lake City, UT, with five GNN models: Graph Convolutional Networks (GCNs), Simple Graph Convolutional Networks (SGConv), Graph Isomorphism Networks (GINs), Graph Attention Networks (GATs), and GraphSage. The experimental results of single- and multistep predictions confirm that GraphSage achieves the highest accuracy in predicting the concentrations of PM 1 , PM 10 , and PM _2.5 over different time horizons. Furthermore, \colorblack GNNExplainer (Graph Neural Network Explainer) and PGExplainer (Probabilistic Graph Explainer) are applied to interpret feature importance and graph structure, ensuring model transparency. Results show improved prediction accuracy, with GNN models outperforming traditional machine learning \textcolorblackand deep learning models (i.e., Prophet, Long short-term memory, Gated recurrent units in air pollution forecasting.

[LG-40] Dont Go Breaking My LLM : The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration

链接: https://arxiv.org/abs/2606.24970
作者: Pietro Tropeano,Maria Maistro,Tuukka Ruotsalo,Christina Lioma
类目: Machine Learning (cs.LG)
*备注: Accepted at TMLR

点击查看摘要

Abstract:Pruning Large Language Models (LLMs) reduces memory and inference costs by removing parts of the network, producing smaller models that retain most of their accuracy. As attention layers are the most resource-intensive parts of LLMs, pruning them is a promising compression strategy. Prior work shows that up to 33% of attention layers can be pruned with minimal accuracy loss. Nevertheless, the impact of attention pruning on model interpretability, specifically faithfulness and confidence calibration, remains unstudied. To address this gap, we study how pruning attention layers affects explanation faithfulness and confidence calibration across five LLMs and eight datasets. While the pruned models often maintain high accuracy, we find that their faithfulness and calibration often degrade. Notably, faithfulness and calibration can fluctuate significantly, even when accuracy remains stable, highlighting a misalignment between model confidence, interpretability, and accuracy. Our findings suggest that layer pruning can affect LLMs’ interpretability and reliability in ways not captured by accuracy and efficiency measures alone. We recommend including explainability and calibration metrics when evaluating pruned models.

[LG-41] Frequency Domain Reservoir Computing

链接: https://arxiv.org/abs/2606.24969
作者: Klaus Schertler,Xiomara Runge,Andrea Ceni,David Kappel,Claudio Gallicchio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the quadratic sequence-length bottleneck of transformers has fueled a resurgence in recurrent models, effectively capturing complex dynamics requires architectures that balance efficient training with highly expressive latent states. Echo State Networks (ESNs) offer a compelling approach by utilizing fixed recurrent weights to circumvent backpropagation through time, enabling a closed-form training solution. However, achieving the expressivity needed for complex tasks demands large reservoirs, exposing an \mathcalO(N^2) state-update bottleneck that prevents ESNs from matching the scale of contemporary recurrent models. To address this limitation, we introduce Frequency Domain Reservoir Computing (FRESCO), an ESN architecture operating entirely in the frequency domain while avoiding domain-shift overheads to achieve \mathcalO(N) complexity for dense, non-linear recurrent updates. By employing a novel dimensional zero-padding input embedding, a packed \FDh readout, and a natively applied frequency-domain non-linearity, FRESCO drastically reduces computational costs and energy consumption of training and inference. Furthermore, FRESCO matches the state-of-the-art predictive performance on memory benchmarks, sequential classification, and multivariate long-horizon forecasting, offering a scalable path forward for dense recurrent architectures.

[LG-42] raining Dynamics of Neural Software Defect Predictors under Coupled Data-Quality Issues

链接: https://arxiv.org/abs/2606.24968
作者: Emmanuel Charleson Dapaah,Philip Makedonski,Jens Grabowski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Context: Software defect prediction supports maintenance decisions such as testing prioritization, release-risk assessment, and quality monitoring. However, metric-based SDP datasets often contain coupled data-quality issues, especially class imbalance and class overlap. Prior work has mainly measured their impact through endpoint performance, while recent evidence suggests that such issues may also appear in neural training dynamics (gradients, weights, biases, error trajectories). However, these studies examine issues in isolation, leaving open how internal neural network training patterns manifest when data quality issues are coupled. Objective: We investigate how training-dynamics patterns from class imbalance, overlap, and their coupling can be characterized under interaction-aware conditions in deep learning-based SDP. Method: We conduct a controlled intervention study on class-level UBD datasets, training a fixed MLP under imbalance-only, overlap-only, and joint conditions across five seeds. Training dynamics are logged per epoch; fidelity is monitored via coupling ratios. Patterns are characterized using effect sizes, trajectories, sensitivity analyses, and rule-based classification. Expected contribution: The study will produce an interaction-aware empirical protocol and a candidate taxonomy of training-dynamics patterns for coupled data-quality issues in metric-based SDP. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.24968 [cs.LG] (or arXiv:2606.24968v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.24968 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Emmanuel Charleson Dapaah [view email] [v1] Tue, 23 Jun 2026 10:08:55 UTC (124 KB)

[LG-43] Learning Dynamical Systems from Multiple Sparse Datasets: A Hierarchical Bayesian Modeling Approach

链接: https://arxiv.org/abs/2606.24966
作者: Cristian Brugnara,Lea Multerer,Marco Forgione,Laura Azzimonti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating parameters of dynamical systems from sparse, noisy, and irregularly sampled data is often severely ill-conditioned. When multiple related datasets are available, they provide additional information if the shared structure and variability are properly modeled. We propose a hierarchical Bayesian framework for probabilistic meta-learning in dynamical systems, modeling dataset-specific parameters as draws from a shared population distribution. A numerical ODE solver is embedded within gradient-based MCMC to enable efficient posterior inference of the shared population and dataset-specific parameter distribution. Experiments show improved predictive performance over unpooled methods, highlighting the potential for data-efficient system identification in settings with sparse data.

[LG-44] Evidence for feature-specific error correction in LLM s

链接: https://arxiv.org/abs/2606.24964
作者: Francisco Ferreira da Silva,Stefan Heimersheim
类目: Machine Learning (cs.LG)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:Understanding the features of large language models (LLMs) is a central goal of interpretability. LLMs are commonly assumed to use superposition to represent more features than they have dimensions. They may not only represent features in superposition but also perform computation in superposition. Theory predicts that computing in superposition requires error correction that privileges feature directions over generic ones, but this prediction has not been tested empirically. We propose an empirical test of error correction in LLMs based on activation perturbations. Perturbing residual-stream activations, we find that they are robust to small perturbations–forming activation plateaus consistent with error correction–but less robust along candidate feature directions (“pure” directions, constructed from contrastive prompt pairs) than along mixtures of two such directions, indicating that the pure directions are privileged. We quantify this privilegedness by modeling the perturbation effect as a function of the L^p -norm of its decomposition into feature components. For p=2 the response is a quadratic form with at most as many nonzero eigenvalues as the residual-stream dimension, which cannot privilege the many feature directions superposition requires. p2 lifts this constraint and is consistent with feature-specific error correction. We find p2 for contrastive, MELBO, and SAE-decoder directions, and p\approx2 for random and PCA directions (controls). These results replicate across Gemma-2-9B, Qwen3-1.7B, Llama-3.1-8B, Mistral-7B-v0.3, Aya-Expanse-8B, and Yi-1.5-9B. We further validate our method on a toy model of error correction with known ground-truth features, recovering p2 for true feature directions, degrading toward 2 as we rotate away from them.

[LG-45] owards Scalable Multi-Task Reinforcement Learning with Large Decision Models

链接: https://arxiv.org/abs/2606.24962
作者: Thibaut Kulak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent progress in large-scale sequence modeling has shown that a single model can learn useful representations across highly diverse data distributions. Inspired by these advances, we investigate whether a unified transformer policy can be trained across large collections of heterogeneous reinforcement learning environments. We introduce LDM-v0, a Large Decision Model trained offline on trajectories collected from thousands of environments spanning multiple domains and modalities. LDM-v0 is a multi-task, multi-modal transformer policy conditioned on histories of observations, actions, rewards, and termination signals, and trained through supervised next-action prediction over offline trajectories. We describe the environment infrastructure, automated data generation pipeline, model architecture, and training methodology used to build LDM-v0, and evaluate its performance across diverse environments. We show that a single pretrained model matches the performance of independently trained task-specific reference policies on approximately 1,000 environments including robotics, autonomous driving, inventory management, cybersecurity, trading, and video games. These results demonstrate the feasibility of large-scale offline pretraining across heterogeneous reinforcement learning environments using a single transformer policy. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.24962 [cs.LG] (or arXiv:2606.24962v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.24962 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Swarm-Inspired Generation of Collective Behaviors in Graph Dynamical Systems

链接: https://arxiv.org/abs/2606.24958
作者: Ji Chen,Song Chen,Chengzhang Gong,Li Fan,Chao Xu
类目: Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Collective behavior arises when locally interacting units produce coordinated global organization, from synchronization in dynamical systems to task-relevant information flow on graphs. The central challenge is not only to explain how collective behavior emerges, but to design local interaction rules that can produce desired global organization and generalize across graphs, dynamics and this http URL address this challenge, we introduce the Swarm-Inspired Emergent Synchronizer (SIES), a graph-dynamical framework that learns generalizable local-interaction laws for controllable collective organization. Each node is an agent-like dynamical unit with a state and task cue, and signed source-target-conditioned attention acts as an adaptive coupling term inside an explicit evolution model. Therefore, SIES combines an explicit dynamical engine with local agent intelligence, similar to biological swarms. For synchronization control, SIES learns a generalizable coupling operator that produces prescribed synchronization patterns for CDSs across untrained network scales, target phase relations, and intrinsic node dynamics without retraining. The learned operator also reaches gait-related modes faster than three oscillator baselines and generalizes synchronization-driven locomotion to simulated multi-legged robots of different scales and a physical hexapod after leg disablement. For graph representation learning, SIES applies the same signed interaction principle to message passing and achieves the highest performance among the compared methods on heterophilous node-classification benchmarks. Together, these results position SIES as a generalizable and learnable graph-dynamical interaction framework with promise for synchronization control, adaptive robot coordination, and heterophilous graph representation learning.

[LG-47] owards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series

链接: https://arxiv.org/abs/2606.24955
作者: Yujiang He,Frederic Uhrweiller,Bernhard Sick
类目: Machine Learning (cs.LG)
*备注: The submission is under review

点击查看摘要

Abstract:Power forecasting models deployed in real-world energy markets must operate under nonstationary conditions, where data distributions continually evolve due to weather variability, infrastructure upgrades, and changing consumption behaviors. In practice, these models face strict operational constraints: historical data may be limited or unavailable for repeated retraining, and uninterrupted long-term service is often required. This paper addresses these challenges by proposing the paradigm of Continuous Power Forecasting, which views power forecasting as a continual learning problem rather than a static offline task. Based on an adaptive continual learning framework for regression, we systematically investigate the practical effectiveness of six representative continual learning approaches from three methodological categories. These approaches are evaluated under different realistic assumptions regarding data accessibility and update policies. Experimental validation on real-world power datasets demonstrates that continual learning enables forecasting models to self-adapt to distributional drift, accumulate knowledge over time, and mitigate catastrophic forgetting without relying on large-scale historical data storage. Beyond performance gains, our study provides practical insights into the stability and adaptation behaviors of different continual learning approaches under realistic operational constraints. Overall, this work illustrates how continual learning can be pragmatically integrated into industrial power forecasting pipelines, offering a scalable and sustainable solution for long-term deployment in dynamic environments.

[LG-48] How Complexity Contributes to Learning Opacity in Machine Learning

链接: https://arxiv.org/abs/2606.24953
作者: Joachim Stein,Eric Raidl
类目: Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注:

点击查看摘要

Abstract:Machine learning (ML) algorithms are known to be opaque. We do not know the reasons for their predictions. The learning process leading to the prediction function is also opaque. We do not fully understand the time evolution of the weight values of neural nets (NN) and related dynamical phenomena. While prediction opacity is widely studied, learning opacity remains largely underexplored. This article studies learning opacity trough the lens of complex dynamical systems. We argue that NN learning is essentially a complex system and that learning opacity is due to dynamical complexity and the epistemological challenges that arise from it. We identify three key properties of training complexity – sensitivity to weight initialization, feedback in gradient based optimization, and sensitivity to the training data – and show how each contributes to learning opacity. As these properties are fundamental to the learning process damping or eliminating them would fundamentally alter how ML systems learn. Some sources of opacity in ML may hence be irreducible.

[LG-49] Supervised Reinforcement Learning for the Coordination of Distributed Energy Resources

链接: https://arxiv.org/abs/2606.24947
作者: Haoyuan Deng,Yihong Zhou,Thomas Morstyn,Yi Wang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Presented at PSCC2026

点击查看摘要

Abstract:The increasing integration of distributed energy resources (DERs) is crucial for power system decarbonization, yet unlocking DERs’ flexibility is challenged by their inherent uncertainties and modelling complexity. As traditional optimization methods struggle with such uncertainty and complexity of DERs, reinforcement learning (RL) has emerged as a promising alternative for DER management. However, standard RL methods suffer from sample inefficiency and sub-optimality when trained from scratch. Inspired by the training paradigms in large language models, this paper proposes a Supervised Reinforcement Learning (SRL) framework for learning DER coordination policies. This framework first pre-trains a policy on demonstration data in a supervised-learning fashion, which is then further fine-tuned using RL. Furthermore, we propose a two-step fine-tuning process: offline fine-tuning for enhancing policy performance and online fine-tuning for adapting it to the real-world dynamics. Experiments demonstrate that RL implementations based on the proposed framework significantly outperform all benchmarks, achieving high cost efficiency even under low-quality demonstration data.

[LG-50] Conformal Orbit-Valid Trust Horizons for Equivariant World Models

链接: https://arxiv.org/abs/2606.24946
作者: Hongbo Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Learned world models are useful only over horizons on which their rollout error remains controlled. We study trust-horizon certification for latent world models with known group symmetries. Given a one-step latent residual and a finite-time expansion estimate, we form a raw horizon curve and calibrate it with a split-conformal multiplicative factor. On the reproducible audit set, the conformal factor is \gamma_\alpha=1.0 : the raw certificate is already conservative under the audit protocol. Across 50 stable audits, we observe zero anti-conservative violations, corresponding to an exact-binomial 95% upper bound of 5.8% on the violation rate. Our main structural result is that exact equivariance transports a calibrated trust-horizon curve over the group orbit: when the environment dynamics, encoder, predictor, action transform, and latent metric satisfy the stated equivariance/invariance conditions, rollout errors and trust horizons are orbit-constant. Empirically, the implemented models exhibit small orbit-transport residuals, with median 1.1% and maximum 4.1% over 14 orbit audits. The certificate is also non-vacuous (median certified-to-measured horizon ratio 0.67). A certificate-level calibration-cost study shows two complementary regimes. On a symmetric 2D substrate, equivariant, plain, and augmented models are all orbit-valid from a single calibration sector – no separation, because the substrate already makes non-equivariant baselines approximately orbit-robust. A 3D yaw audit shows the other regime: the equivariant model obtains a one-sector safe and non-vacuous orbit-valid certificate, while healthy non-equivariant baselines pay violation, slack, sharpness, or additional-sector cost. The certificate is a conservative, distributional audit rather than a global reachability guarantee, and certificate-guided subgoal spacing is not confirmed in the current 3D CEM-MPC behavior layer.

[LG-51] When Do Conservation Laws Survive Learned Representations? Certified Horizons for Latent World Models

链接: https://arxiv.org/abs/2606.24945
作者: Hongbo Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, including appendices. Code: this https URL

点击查看摘要

Abstract:We ask a representation-learning question about physical world models: when does a conservation law remain certifiable after a model learns a latent representation? A certified horizon bounds – in advance, from measurable model defects – how many steps a rollout provably stays on a physical invariant’s level set. The key design choice is what is certified: not a learned latent Hamiltonian or a learned scalar witness (a model can conserve either while drifting in true energy), but the decoded physical invariant obtained by decoding the latent state and evaluating the known invariant. Around this object we derive shell-horizon certificates whose budget decomposes into representation, readout, and latent-dynamics defects, with a monotone alignment bridge through which a soft learned witness yields a certified horizon for the decoded invariant, and test them across state, learned-lift, and pixel observations on conservative systems. Conservation certificates can survive learned representation, but not all geometric priors survive equally: hard canonical symplectic structure yields the longest horizons in known phase coordinates yet does not cross a learned chart, whereas a controlled-Lipschitz-aligned soft invariant survives in the learned-representation settings we test; pixel certification is recovered on a readout-stable sub-tube; and the Kepler problem exposes a geometric boundary. The central object is therefore not a latent Hamiltonian, but a decoded physical invariant whose robustness to representation learning can be measured, certified, and falsified.

[LG-52] A Spectral Phase Diagram for Binary Few-Shot Classification: Intrinsic Dimensionality Geometric Saturation and Representational Diagnosis

链接: https://arxiv.org/abs/2606.24903
作者: Arnav Gupta
类目: Machine Learning (cs.LG)
*备注: 85 pages, 5 figures, 32 tables

点击查看摘要

Abstract:Deciding when to stop collecting labeled examples is a fundamental but undertheorized problem in applied machine learning. The saturation index S(K) = \operatornameerank(\widehat\Sigma_W^(K)) / K measures the ratio of the effective rank of the pooled within-class sample covariance to the shot count; we prove it falls below a threshold precisely when the covariance estimator is well-concentrated around the population covariance and the linear discriminant has stabilized. The index is computable in O(d^3) time from support features alone, requiring no test labels or trained classifier. Evaluated across N = 246 doubling-pair observations from seventeen binary tasks and six datasets, sixteen of seventeen tasks have a positive within-task Spearman correlation between S(K) and marginal accuracy gain (median \rho = 0.811 ). The pooled Spearman correlation is \rho = 0.548 ( p = 1.1 \times 10^-20 , N = 246 ). A three-phase diagram (exploration, transition, saturation) with mean marginal gains of 3.48% , 2.40% , and 0.82% is supported by all pairwise significance tests ( p \leq 0.008 ). As a binary stopping rule, the index achieves AUC = 0.752 , providing meaningful probabilistic guidance for annotation decisions. Asymptotic effective rank and peak accuracy show no significant monotone relationship across tasks (Spearman r_s = 0.380 , p = 0.133 , N = 17 ). A small saturation index paired with low accuracy diagnoses representational inadequacy. All results are for binary classification with a fixed linear classifier; extensions to N -way settings and pretrained backbone representations are discussed as future work. Comments: 85 pages, 5 figures, 32 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.24903 [cs.LG] (or arXiv:2606.24903v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.24903 Focus to learn more arXiv-issued DOI via DataCite

[LG-53] When Does Synthetic Data Augmentation Improve Score-Based Imbalanced Classification?

链接: https://arxiv.org/abs/2606.26053
作者: Zhengchi Ma,Pengfei Lyu,Anru R. Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data augmentation is widely used to mitigate class imbalance, but its theoretical effects on score-based classification remain poorly understood. This paper develops a framework for characterizing when synthetic minority augmentation can improve threshold-integrated and threshold-optimized metrics, including AUROC, AUPRC, best-threshold balanced accuracy, and best-threshold (\F_1) score. We separate the effect of augmentation into two components: a change in effective class weighting and a discrepancy between the synthetic and true minority distributions. Under well-specified score models, the raw estimator already targets the likelihood-ratio ordering, which is population-optimal for the metrics considered. Consequently, augmentation cannot provide a fundamental population-level improvement beyond possible finite-sample variance reduction, and may introduce additional bias through synthetic distributional error. We further establish minimax lower bounds showing that the raw estimator already achieves the optimal metric-regret rate in the well-specified regime. Under misspecification, however, augmentation can play a qualitatively different role: by changing the effective class balance, it can alter the restricted-class projection and correct ranking errors induced by the raw imbalanced objective. We provide explicit improvement bounds quantifying the roles of approximation error, finite-sample estimation error, and synthetic distributional error. Simulation studies corroborate the theory, demonstrating limited gains under well-specification and nontrivial but nonmonotone improvements under misspecification.

[LG-54] Knowledge Cascade: Reverse Knowledge Distillation on Nonparametric Multivariate Functional Estimation

链接: https://arxiv.org/abs/2606.25927
作者: Luyang Fang,Haoran Lu,Yongkai Chen,Wenxuan Zhong,Ping Ma
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As machine learning models and datasets continue to grow, developing complex models has become increasingly computationally demanding. Knowledge distillation reduces deployment cost by compressing a large, well-trained teacher model into a compact student model, but it does not address settings where constructing the teacher itself is the bottleneck. Motivated by this challenge, we introduce Knowledge Cascade (KCas), a reverse knowledge distillation framework that uses information from a small, inexpensive student model to guide the development of a more complex teacher model. Although this direction is counterintuitive because the teacher typically has greater representational capacity, we show that student-to-teacher transfer can be principled when supported by statistical scaling relationships. We first develop KCas for nonparametric multivariate functional estimation in reproducing kernel Hilbert spaces via smoothing splines, where selecting multiple smoothing parameters is a major computational bottleneck. KCas transfers student-selected smoothing parameters to the full-sample regime through asymptotic scaling laws, substantially reducing computational cost for high-dimensional and large-scale datasets while retaining theoretical guarantees. Beyond smoothing splines, we illustrate the same principle through kernel density estimation and deep learning hyperparameter transfer. Simulations and real-data experiments show that KCas achieves substantial computational savings while maintaining strong statistical performance, and can sometimes outperform the corresponding full-sample procedure.

[LG-55] Hierarchical Graph Learning for Calendar Spread Strategies in Commodity Futures Markets

链接: https://arxiv.org/abs/2606.25811
作者: Yoonsik Hong,Diego Klabjan
类目: Trading and Market Microstructure (q-fin.TR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Portfolio Management (q-fin.PM); Pricing of Securities (q-fin.PR)
*备注:

点击查看摘要

Abstract:Commodity futures can be represented hierarchically, with underlying assets at the upper level and individual futures contracts at the lower level. Entities at each level can be connected by edges reflecting inherent correlations, with cross-level edges capturing contract-to-underlying asset connections. Building on our observations of these structures, we propose a hierarchical graph learning approach for calendar spread (CS) strategies in commodity futures markets, addressing two significant gaps in the machine-learning literature: (i) the absence of learning-based methods for CS strategies in futures markets, and (ii) the lack of consideration of maturity-dependent interrelationships across commodity futures. We first establish the efficacy of CS strategies by analytically showing that CS strategies can possess higher risk-adjusted returns, measured by the information ratio, and lower risk, measured by variance and delta, than long-only strategies. We then introduce a method to convert learning-based predictions into CS positions. Next, we develop a hierarchical graph learning method that predicts futures price movements by utilizing the maturity-dependent interrelationships, thereby yielding a CS trading algorithm. Empirical results on commodity futures markets traded on the Chicago Mercantile Exchange Group demonstrate that our method outperforms benchmark models in both prediction and trading performance. We find that maturity-dependent interrelationships across commodity futures are instrumental in prediction and that CS trading based on hierarchical graph learning is effective for statistical arbitrage.

[LG-56] Generating Input Distributions for Explaining Portfolio Optimization Pipelines

链接: https://arxiv.org/abs/2606.25808
作者: Batuhan Ataş,Nurşen Aydın,E. Mehmet Kıral,Ş. İlker Birbil
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a predict-optimize-explain framework that uses gradient-based sample generation to interpret various portfolio models by identifying macroeconomic conditions that induce specified portfolio outcomes. Unlike traditional feature-importance methods, this approach directly probes decision pipelines (predictive models coupled with portfolio optimization) by constructing economically meaningful what-if questions. We focus on four such questions: under what macroeconomic conditions a predict-then-optimize pipeline closes or reverses its return gap with a predict-and-optimize pipeline; what conditions lead a pipeline to diversify rather than concentrate its allocation; when a pipeline trained on calm markets overtakes one trained through crises; and what conditions would let a pipeline match a benchmark return. These examples illustrate how our framework uncovers key behavioral differences between various decision pipelines. Beyond these cases, the proposed framework is flexible and can support a wide range of probing questions tailored to specific portfolio objectives. Our findings highlight the value of integrating prediction, optimization, and explanation to produce more robust and transparent portfolio strategies.

[LG-57] Gaussian Mean Field Variational Inference can Overestimate Predictive Variance

链接: https://arxiv.org/abs/2606.25745
作者: James Odgers,Ben Riegler,Siddharth Swaroop,Vincent Fortuin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mean Field Variational Inference (MFVI) is widely understood to underestimate posterior variance. By analysing conjugate Bayesian Linear Regression (BLR), we show that this characterization is incomplete: while MFVI underestimates the variance in parameter space, it can overestimate the predictive variance compared to the exact posterior. We show that if the MFVI posterior underestimates predictive variances in some directions, it necessarily overestimates them in others. Crucially, this overestimation occurs in directions where the training data concentrates. This leads to the surprising result that, for a test point drawn from the training distribution, MFVI’s expected predictive variance exceeds that of the exact posterior. We demonstrate a pathological case of this effect, where the MFVI posterior fails to reduce predictive variance compared to the prior on in distribution data. We connect these results to the Cold Posterior Effect, arguing that varying the temperature can correct this overestimation, yielding predictions closer to those of the exact posterior. We validate our theory on synthetic and real-world regression tasks.

[LG-58] Statistically Valid Hyperparameter Selection: From Tuning to Guarantees

链接: https://arxiv.org/abs/2606.25601
作者: Amirmohammad Farzaneh,Osvaldo Simeone
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Hyperparameter selection is a critical step in the deployment of modern artificial intelligence systems, given the need to tune degrees of freedom such as inference-time parameters, implementation-level settings, and thresholds driving decision rules. Despite its practical importance, hyperparameter selection is typically performed using best-effort empirical methods such as grid search or Bayesian optimization, which provide no formal statistical guarantees on reliability or safety. This monograph presents a unified statistical framework for reliable hyperparameter selection, centered on the learn-then-test (LTT) paradigm, which formulates the problem as multiple hypothesis testing over a candidate set of hyperparameters. The framework enables the selection of hyperparameters that provably satisfy application-specific reliability requirements – such as bounds on average risk, quantile risk, or information-theoretic constraints – with explicit, finite-sample control of error probabilities. The supporting statistical machinery, namely p-values, e-values, and concentration inequalities, is developed from first principles in a dedicated appendix. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2606.25601 [stat.ML] (or arXiv:2606.25601v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.25601 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] wo-dimensional Hyperbolic RNN Neural Quantum State

链接: https://arxiv.org/abs/2606.25600
作者: H. L. Dao
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:In the first part of this work, we construct the first type of two-dimensional (2D) hyperbolic neural quantum state (NQS) in the form of the Lorentz 2DRNN (Recurrent Neural Network) and benchmark its performance against the Euclidean 2DRNN in the paradigmatic N\times N 2D Transverse Field Ising Model (2DTFIM) setting with different lattice sizes up to N=12 and at different transverse magnetic field strengths. We find that hyperbolic Lorentz 2DRNN NQS definitively outperform Euclidean 2DRNN NQS when the system is at the phase transition point when the physics can be described by a conformal field theory (CFT), which is known to be dual to an Anti-de-Sitter (AdS) space whose spatial geometry is hyperbolic. In the second part of this work, we benchmark the performances of the recently introduced one-dimensional (1D) hyperbolic NQS including Poincaré RNN/GRU and Lorentz RNN/GRU against their Euclidean NQS versions in N\times N 2DTFIM, which has to be converted to a one-dimensional setting to allow for the use of 1D NQS. The findings in this case extend our previous results that 1D hyperbolic NQS definitively outperform 1D Euclidean NQS, thanks to the combined effects of the hierarchical structure comprising the first and N^th neighbor interactions present in the 1D system arising from the 2D lattice and the CFT physics at the critical point. While more studies with larger system sizes are required, our work serves as a proof-of-concept for the utility, effectiveness as well as the superior performances of one- and two-dimensional hyperbolic NQS ansatzes compared to the existing Euclidean NQS in many-body quantum physics systems, especially when these systems exhibit structural hierarchy or when they are at criticality, or a combination of both.

[LG-60] Blasto-Net: An Explainable Multi-Task Learning for Blastocyst Segmentation Grading and Implantation Prediction

链接: https://arxiv.org/abs/2606.25463
作者: Zahra Asghari Varzaneh,Reza Khoshkangini,Magnus Johnsson,Thomas Ebner,Lars Johansson
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces Blasto-Net, a multi-task deep learning model for comprehensive blastocyst analysis. The proposed model performs three tasks simultaneously in a single forward pass: segmentation of the ZP, TE, and ICM compartments, morphological grading, and implantation outcome prediction. Accurate blastocyst analysis in in vitro fertilization (IVF) is challenging. The compartments often have similar textures but very different structures. To address these challenges, Blasto-Net employs an EfficientNet-B3 encoder with a UNet-style decoder enhanced by the Convolutional Block Attention Module (CBAM) and a novel Edge-Aware Attention Module (EAAM) to effectively capture both semantic and boundary information. To handle distinct compartment topologies, the network employs specialized segmentation heads and a composite region- and boundary-based loss. Additionally, Grad-CAM++ visualizations are used to verify the anatomical consistency of the model’s predictions. Evaluated on a public HMC blastocyst dataset, Blasto-Net achieves Dice scores of 94.93%, 91.60%, and 88.82% for ICM, ZP, and TE, respectively, alongside an implantation F1-score of 80.0%. These results demonstrate that Blasto-Net offers an accurate, interpretable, and efficient solution for automated blastocyst assessment, with strong potential to support clinical decision-making in IVF.

附件下载

点击下载今日全部论文列表