Arxiv今日论文 | 2026-05-22

本篇博文主要内容为 2026-05-22 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共103篇(Computation and Language (cs.CL))
人工智能共236篇(Artificial Intelligence (cs.AI))
计算机视觉共164篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共264篇(Machine Learning (cs.LG))
多智能体系统共16篇(Multiagent Systems (cs.MA))
信息检索共12篇(Information Retrieval (cs.IR))
人机交互共27篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

【速读】：该论文试图解决多智能体大语言模型（LLM）系统中基于键值（KV）缓存的潜在通信（latent communication）所引发的敏感信息泄露问题。现有方法虽利用KV缓存提升协作效率并保留任务相关信息，但其同时编码了上下文输入、中间推理状态及代理特有信息，导致敏感内容可能在无显式文本披露的情况下跨代理传播。解决方案的关键在于提出LCGuard（Latent Communication Guard），该框架将共享KV缓存视为潜在工作内存，并在缓存数据传递前学习表示层面的变换策略；通过对抗训练机制——其中攻击者尝试从缓存中重建敏感输入，而LCGuard则优化变换以保留任务相关语义并最小化可重构信息——从而实现对敏感信息泄露的防护。实证结果表明，LCGuard在多个模型族和多智能体基准测试中均能显著降低基于重建的泄露风险与攻击成功率，同时保持与标准KV共享基线相当的任务性能。

链接: https://arxiv.org/abs/2605.22786
作者: Sadia Asif,Mohammad Mohammadi Amiri,Momin Abbas,Prasanna Sattigeri,Karthikeyan Natesan Ramamurthy
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究院)
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can improve efficiency and preserve richer task-relevant information. However, KV caches also encode contextual inputs, intermediate reasoning states, and agent-specific information, creating an opaque channel through which sensitive content may propagate across agents without explicit textual disclosure. To address this, we introduce \textbfLCGuard (Latent Communication Guard), a framework for safe KV-based latent communication in multi-agent LLM systems. LCGuard treats shared KV caches as latent working memory and learns representation-level transformations before cache artifacts are transmitted across agents. We formalize representation-level sensitive information leakage operationally through reconstruction: a shared cache artifact is unsafe if an adversarial decoder can recover agent-specific sensitive inputs from it. This leads to an adversarial training formulation in which the adversary learns to reconstruct sensitive inputs, while LCGuard learns transformations that preserve task-relevant semantics and reduce reconstructable information. Empirical evaluations across multiple model families and multi-agent benchmarks show that LCGuard consistently reduces reconstruction-based leakage and attack success rates while maintaining competitive task performance compared to standard KV-sharing baselines.

[MA-1] Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

【速读】：该论文试图解决的问题是：当前自主系统在孤立或模拟环境中表现优异，但在共享、动态的真实世界空间中仍表现出脆弱性，其根源在于主流的单智能体范式忽略了其他行为者，导致无法有效协调。解决方案的关键在于采用多智能体强化学习（Multi-Agent Reinforcement Learning, MARL），通过联赛式自对弈训练，使智能体学会复杂的战略行为，如预判性避障、超车及处理多智能体物理交互（包括气动下洗效应）。实验表明，该方法在高速四旋翼竞速场景中超越了冠军级人类飞行员的表现，同时将碰撞率降低50%，且训练过程中引入多样化的人工智能对手实现了对人类交互的零样本泛化能力，证明了多智能体交互的严格要求才是实现机器人安全共存的核心路径。

链接: https://arxiv.org/abs/2605.22748
作者: Ismail Geles,Leonard Bauersfeld,Markus Wulfmeier,Davide Scaramuzza
机构: University of Zurich (苏黎世大学); Google DeepMind (谷歌深度思维); Nomagic (诺玛吉)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages (+4 supplementary). Website: this https URL

点击查看摘要

Abstract:Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: this https URL

[MA-2] Self-Evolving Multi-Agent Systems via Decentralized Memory

【速读】：该论文旨在解决当前基于大语言模型（LLM）的多智能体系统（MAS）中集中式记忆架构所引发的问题，包括通信与协调开销高、隐私风险以及智能体多样性下降等。其解决方案的关键在于提出一种去中心化记忆框架 DecentMem，其中每个智能体维护一个双池记忆结构：一个是用于策略优化的利用池（exploitation pool），存储已验证的历史轨迹；另一个是用于探索未知情境的探索池（exploration pool），包含由 LLM 生成的候选路径。这两个池通过 LLM-as-a-judge 提供的分阶段反馈在线重加权，从而在保证全局解空间可达性的前提下，实现接近最优的累积遗憾（O(log T)），理论上匹配随机多臂赌博机的下界。实证结果表明，DecentMem 在多个 MAS 框架、不同规模的 Qwen3 和 Gemma4 模型上，在数学、代码、问答和具身任务等多个基准测试中，相较于最强的集中式记忆基线平均准确率提升高达 23.8%，相较无记忆基线提升达 52.5%，同时 token 使用量减少最多达 49%。

链接: https://arxiv.org/abs/2605.22721
作者: Guangya Hao,Yunbo Long,Zhuokai Zhao
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Self-evolving multi-agent systems (MAS) have emerged as a promising route to LLM agents that continually improve from experience, with persistent memory at their foundation. However, existing designs almost exclusively adopt a centralized repository shared across agents, incurring communication and coordination overhead, raising privacy concerns, and collapsing agent diversity. We propose DecentMem, a decentralized memory framework in which each agent maintains its own dual-pool memory – an exploitation pool of consolidated past trajectories and an exploration pool of LLM-generated candidates for unseen contexts. The two pools are reweighted online based on stage-wise feedback from an LLM-as-a-judge. Theoretically, we prove that this design guarantees global reachability of the solution space and achieves O(\log T) cumulative regret, matching the stochastic bandit lower bound up to constants. In practice, across three MAS frameworks (AutoGen, DyLAN, AgentNet), three Qwen3 backbones (4B/8B/14B), two Gemma4 backbones (E2B/E4B) and five benchmarks spanning math, code, QA, and embodied tasks, DecentMem improves average accuracy by up to 23.8% over the strongest centralized memory baseline and by up to 52.5% over the no-memory baseline, while reducing token usage by up to 49%.

[MA-3] A Generalized Nash Equilibrium-Seeking Scheme for Trauma Resuscitation

【速读】：该论文旨在解决创伤复苏过程中如何量化并优化医护人员（HCWs）决策的问题，以提升患者预后。当前复苏流程依赖于临床经验，缺乏可量化的指标来反映医护人员在资源有限、工作负荷重等复杂条件下的行为模式。解决方案的关键在于将创伤复苏建模为一个具有耦合不等式约束的分布式广义纳什均衡（GNE）寻求博弈，并在时变通信图上进行优化。该方法融合了临床实践经验，首次将HCWs的工作负荷、排班、能力与资源限制纳入模型，从而实现多角色协作下的最优复苏策略，推动复苏流程从经验驱动向数据驱动演进。

链接: https://arxiv.org/abs/2605.22661
作者: Promise Ekpo,Angelique Taylor,Lekan Molu
机构: Cornell University (康奈尔大学)
类目: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Trauma resuscitation is a clinical process for treating life-threatening physiological disorders in safety-critical environments, driven by the experience of healthcare workers (HCWs). Designing and optimizing quantifiable metrics that accurately capture HCW decisions may augment current resuscitation procedures with the potential to improve patient outcomes. This motivates our socio-technical formulation of trauma resuscitation as a distributed generalized Nash equilibrium (GNE)-seeking game with coupled inequality constraints. This method is optimized over a time-varying communication graph. We introduce novel insights from clinical experience to model HCWs behavior. This work facilitates the best possible resuscitation outcome given HCWs workloads, schedules, competencies, and limited resources.

[MA-4] Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses Not Paper Generators

【速读】：该论文试图解决当前自主研究系统（AutoResearch systems）在执行科学工作流时缺乏“研究判断力”的问题，即系统虽能自动化完成实验、分析和写作等任务，却无法从失败中学习并改进后续行为。其解决方案的关键在于引入一个名为“科学试错框架”（Scientific Trial-and-Error Harnesses）的自演化机制，通过两个可审计的转换单元实现经验积累与行为迭代：一是“试错到行为转换”（trial-to-behavior conversion），将单次试验信号映射为后续研究动作；二是“试错到框架行为转换”（trial-to-harness-behavior conversion），将重复性过程失败转化为系统自身修复或优化。该框架在SIBYL系统中实现，支持对状态、角色、记忆、门控及产物轨迹的可视化追踪，实证表明其能在真实自主研究环境中识别出8个高置信度的转换事件，并有效拦截或缓解5类常见失败模式（如重复结果、过期数值、不支持统计量等）。

链接: https://arxiv.org/abs/2605.22343
作者: Chengcheng Wang,Qinhua Xie,Wei He,Jianyuan Guo,Shiqi Wang,Chang Xu
机构: University of Sydney (悉尼大学); East China Normal University (华东师范大学); TokenRhythm AI; City University of Hong Kong (香港城市大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl-AutoResearch, a self-evolving AutoResearch framework built around Scientific Trial-and-Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial-to-behavior conversion, which links trial signals to later research actions, and trial-to-harness-behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file-backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high-confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered-failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous-research workspaces. The SIBYL framework and system are available at this https URL.

[MA-5] ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps

【速读】：该论文旨在解决开放无线接入网（O-RAN）中近实时无线接入网络智能控制器（Near-Real Time RAN Intelligent Controller）因控制指令冲突导致的网络性能下降问题，即冲突缓解（Conflict Mitigation, ConMit）问题。解决方案的关键在于提出一种名为ACCoRD的方法，其核心是一个基于人工神经网络（ANN）的冲突解析（CR）代理，该代理通过近端策略优化（PPO-Clip）强化学习算法进行训练。该ANN能够分析网络状态和冲突控制决策数据，推断出最优的冲突缓解动作；同时，CR代理在每次冲突解决后收集网络反馈，用于评估其效率并在线调整ANN权重，从而实现持续优化。实验结果表明，该基于ANN的方法相比传统规则驱动方法，在中高流量场景下显著减少了由控制冲突引发的负面网络事件，提升了整体网络效率。

链接: https://arxiv.org/abs/2605.22306
作者: Cezary Adamczyk,Adrian Kliks
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conflict Mitigation (ConMit) is a crucial part of intelligent network control in Open Radio Access Networks (O-RAN). In this paper, we propose a method named ACCoRD to resolve detected control conflicts in Near-Real Time RAN Intelligent Controller using a Conflict Resolution (CR) Agent with an Artificial Neural Network (ANN) trained with a reinforcement learning algorithm PPO-Clip. The implemented ANN analyzes data about the network and conflicting control decisions to infer optimal CR actions. The CR Agent gathers feedback from the network after each resolved conflict to assess its efficiency and adjust the ANN’s weights during batch training. The evaluation of the proposed approach is based on simulation data. A new methodology for evaluating CR solutions is proposed. Results show that the proposed ANN-based method improves on the efficiency of rule-based approaches by significantly reducing negative network events caused by conflicting control decisions in medium and high traffic scenarios.

[MA-6] Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

【速读】：该论文试图解决的问题是：在跨学科、多源数据的科学发现场景中，如何判断协调的AI代理（coordinated AI agents）相较于传统简单科学工作流是否真正带来价值。其解决方案的关键在于构建一个跨领域的基准测试（cross-domain benchmark），涵盖四个具体科学任务（分子结构音乐化、科学范式转变检测、虫媒疾病暴发识别、系外行星候选验证），并通过预定义评分协议、明确基线、消融实验与空模型对照等方法，系统评估不同情境下AI协作的价值。结果表明，仅当协同策略能提升性能（如气候-媒介疾病预测达到AUROC 0.944）、增强可追溯性（如范式转变检测）或改进表示能力（如分子声学转化）时，才证明其有效性，从而确立了“以显式比较器为依据”的价值判定标准。

链接: https://arxiv.org/abs/2605.22300
作者: Fiona Y. Wong,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Scientific evidence often spans instruments, databases, and disciplines, so no single source records the full phenomenon. This makes it difficult to determine when coordinated AI agents add value over simpler scientific workflows. We evaluate this question with a cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts in science, identifying vector-borne disease emergence, and vetting transiting-exoplanet candidates. Each case uses a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations or null controls, and stated limitations. The results define three operating regimes. When different disciplines each capture only part of the phenomenon, cross-channel composites improve over single-channel baselines: climate-vector emergence reaches AUROC 0.944 and exoplanet vetting reaches AUROC 0.955. However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance. When one signal dominates, as in paradigm-shift detection, coordination mainly improves interpretation and traceability. For molecular sonification, the gain is representational rather than predictive. ScienceClaw x Infinite provides the auditable artifact and provenance layer for this evaluation. The benchmark therefore assigns value to coordination only when the corresponding performance, provenance, or representation claim is supported by explicit comparators.

[MA-7] Emergence of agriculture in an artificial society of reinforcement learning agents

【速读】：该论文试图解决的问题是：农业的起源作为一次重大的进化转变，其背后复杂集体行为如何从简单的个体互动中自发产生。解决方案的关键在于构建一个嵌入动态生态系统的强化学习智能体人工社会，通过模拟个体学习与环境变化之间的耦合动力学，识别出促成农业出现的四个核心机制：个体对延迟收益的价值评估（即规划能力）、社会对作弊者的脆弱性、通过社会学习实现的稳定化，以及一旦建立便难以逆转的“锁定效应”。其中，社会学习起到了关键的“防火墙”作用，有效抑制作弊者入侵并促进成功策略传播，从而推动人口持续增长和驯化资源的非线性放大。这一研究揭示了个体决策、社会互动与生态反馈之间普遍存在的机制联系，并展示了人工社会作为实验平台在探索文化创新和重大进化转变中的潜力。

链接: https://arxiv.org/abs/2605.22256
作者: Gautier Hamon,Martí Sánchez-Fibla,Clément Moulin-Frier,Ricard Solé
机构: Flowers AI and CogSci lab, Inria, Université de Bordeaux, France; Department of Information and Communications Technologies, Universitat Pompeu Fabra, 08018, Barcelona, Spain; Artificial Intelligence Research Institute, IIIA (CSIC), Campus de la UAB, Bellaterra, Barcelona, 08193, Spain; BioTiC team (Inria, INSA Lyon, CITI, UR3720), France; Complex Systems Lab, Universitat Pompeu Fabra, Dr Aiguader 88, 08003 Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats, Lluís Companys 23, 08010 Barcelona, Spain; Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501, USA
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The origin of agriculture represents a major evolutionary transition and a paradigmatic example of how complex collective behaviors emerge from simple interactions. Here we introduce an artificial society of reinforcement learning agents embedded in a dynamic ecological environment to identify general principles underlying this transition. Within this system, agricultural practices emerge spontaneously - without explicit instruction - through the coupled dynamics of learning and environmental modification. We show that this transition is governed by four key ingredients: individual planning through the valuation of delayed rewards, social vulnerability to cheaters, stabilization via social learning, and an emergent lock-in effect that renders agriculture effectively irreversible once established. In particular, we demonstrate that social learning acts as a “firewall” that suppresses cheater invasion and enables the propagation of successful strategies, leading to sustained population growth and nonlinear amplification of domesticated resources. Together, these results reveal universal mechanisms linking individual decision-making, social interactions, and ecological feedbacks. More broadly, they highlight the potential of artificial societies as experimental platforms to study the emergence of cultural innovations and major evolutionary transitions.

[MA-8] he Log is the Agent : Event-Sourced Reactive Graphs for Auditable Forkable Agent ic Systems

【速读】：该论文试图解决当前大多数智能体框架中因依赖语言模型（Language Model, LM）驱动的对话循环而导致的状态管理不透明、可复现性差以及缺乏细粒度因果追踪的问题。现有系统通常将状态作为可检索的“记忆”存储，难以实现精确回放与分支执行。其解决方案的关键在于提出一种名为ActiveGraph的运行时架构，该架构以追加式事件日志（append-only event log）作为事实来源，工作图（working graph）是该日志的确定性投影，而行为逻辑（包括普通函数、类、LLM调用或绑定到类型化边的规则）通过响应图的变化来生成新事件，所有组件之间无显式指令关系，协调完全基于共享图结构。这一设计实现了三个核心特性：从日志中确定性重放任意运行过程、在任意事件处低成本分叉而不重复执行前缀、以及从高层目标到每个模型调用所产生的具体成果的端到端因果链追踪。

链接: https://arxiv.org/abs/2605.21997
作者: Yohei Nakajima
机构: Untapped Capital; activegraph.ai
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages, 1 figure. Open-source Apache-2.0 implementation with reproducible quickstart demo, deterministic replay, fork-and-diff, and lineage tracing

点击查看摘要

Abstract:Most agent frameworks are built around the language model: a conversation loop comes first, then tools, then rules, and finally a logging layer bolted on for observability, with state persisted as retrievable “memory.” We describe ActiveGraph, a runtime that inverts this arrangement. The append-only event log is the source of truth; the working graph is a deterministic projection of that log; and behaviors–ordinary functions, classes, LLM-backed routines, or logic attached to typed edges–react to changes in the graph and emit new events. No component instructs another; coordination happens entirely through the shared graph. This single design decision yields three properties that retrieval-and-summarization memory systems do not provide: deterministic replay of any run from its log, cheap forking that branches a run at any event without re-executing the shared prefix, and end-to-end lineage from a high-level goal down to the individual model call that produced each artifact. We present the architecture, a determinism contract that makes replay sound, and a worked diligence example whose full causal structure is reconstructable from the log alone. We discuss–without claiming to demonstrate–why this substrate is unusually well suited to self-improving agents, and how it extends the BabyAGI lineage and prior graph-memory research.

[MA-9] AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

【速读】：该论文试图解决的问题是：如何通过人工智能（AI）技术提升严肃游戏（serious games）中实时教学适应性（real-time instructional adaptation）的能力，以克服传统严肃游戏中静态场景设计、教学建模不足、难以实现个性化干预等局限。解决方案的关键在于区分“教学智能”（instructional intelligence，指系统推断学习者知识状态并推理出合适教学响应的能力）与“适应性”（adaptivity，指在交互过程中调整教学行为的能力），并基于对自计算机辅助教学到智能辅导系统（ITS）、动态难度调整（DDA）、学习分析等历史演进的梳理，提出利用大语言模型（LLMs）、强化学习（RL）和基于代理的架构来构建更深度融合的教学智能与适应性机制。

链接: https://arxiv.org/abs/2605.21962
作者: Priyamvada Tripathi,Bill Kapralos
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Book chapter, 1 figure. To appear in “Advances in Global Applied Artificial Intelligence,” G. A. Tsihrintzis, M. Virvou, N. G. Bourbakis, and L. C. Jain (Eds.), Springer, Learning and Analytics in Intelligent Systems book series, 2026

点击查看摘要

Abstract:Serious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system’s capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.

[MA-10] race2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

【速读】：该论文试图解决复杂Verilog设计问题（Complex Verilog Design Problems, CVDP）中硬件大语言模型（LLM）代理面临的挑战，即如何在大型代码库快照中精确定位与验证器相关的寄存器传输级（RTL）代码、测试平台、包含路径和构建依赖关系，进行精确编辑，并从稀疏且隐蔽的验证失败中恢复。解决方案的关键在于提出Trace2Skill框架，它通过测试时扩展策略，在不进行RTL专用模型微调的前提下提升代理性能：该框架将代理的自然语言技能视为可进化的策略，通过挖掘重复的执行轨迹识别成功与失败模式，将其转化为密集诊断信息和Oracle指导规则；再利用Oracle、变异器（mutator）和选择器（selector）循环生成任务特定技能，引导后续搜索、编辑、验证和恢复过程；同时引入有限运行时间内的密集验证反馈机制，提供经净化的功能性观测结果，而无需暴露隐藏的测试环境或参考解决方案，从而实现技能演化与代理行为的有效对齐。实验证明，该方法显著提升了硬核CVDP任务的通过率，并在多个此前未解任务上取得突破性进展。

链接: https://arxiv.org/abs/2605.21810
作者: Zijian Du,Nathaniel Pinckney
机构: NVIDIA(英伟达)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent’s natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

[MA-11] Secure Coordination for Vertiport Sequencing in Advanced Air Mobility

【速读】：该论文旨在解决城市空中交通（Advanced Air Mobility, AAM）中垂直起降机场（vertiport）附近密集飞行器调度的可信协调问题，特别是在存在感知不确定性与潜在虚假自报信息（如预计到达时间）的情况下。其核心挑战在于：自利飞行器可能谎报到达时间以获取优先着陆权，恶意攻击者则可能伪造信息干扰调度决策或引发不必要的拥堵。解决方案的关键在于设计一种鲁棒的调度机制——即基于外部监视测量与自报远程识别（Remote-ID）信息的融合，构建一个考虑感知不确定性的稳健优化框架。该框架将自利性误报建模为对个体调度结果有利的战略偏离，将恶意欺骗建模为损害系统整体性能的对抗扰动，并在此基础上制定在观测一致性不确定集上的鲁棒调度规则，从而确保分配的时间间隔满足安全分离要求并具备抗干扰能力。

链接: https://arxiv.org/abs/2605.21771
作者: Jaehan Im,Filippos Fotiadis,Ufuk Topcu,David Fridovich-Keil
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Advanced air mobility operations will require reliable coordination mechanisms for managing dense traffic near vertiports. However, sequencing decisions may become vulnerable when they rely on potentially falsified self-reported information such as estimated time of arrival. Self-interested vehicles may misreport their arrival times to obtain favorable landing priority, while malicious actors may spoof information to disrupt sequencing decisions or induce unnecessary congestion. This paper studies secure coordination for vertiport sequencing under sensing uncertainty. We consider a coordinator that combines self-reported Remote-ID information with externally obtained surveillance measurements to check reports and assign separation-feasible arrival schedules. Since surveillance-based estimates are uncertain, falsified reports may remain consistent with the sensing uncertainty region and cannot always be rejected outright. We therefore formulate sequencing as a robust design problem over this uncertainty region. Self-interested misreporting is modeled as a strategic deviation that improves the reporting vehicle’s own sequencing outcome, whereas malicious spoofing is modeled as an adversarial disturbance that degrades the system-level objective. The final paper will develop robust sequencing rules over surveillance-consistent uncertainty sets and evaluate their performance in representative vertiport sequencing scenarios.

[MA-12] Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

【速读】：该论文试图解决在多轮会话环境中，基于记忆增强的大语言模型（LLM）智能体在使用强化学习（Reinforcement Learning, RL）进行训练时所面临的信用分配（credit assignment）难题。具体而言，由于记忆操作使不同轨迹的中间状态不再一致，导致传统基于轨迹级别的相对优化方法（如GRPO）无法公平比较不同策略的表现，从而产生噪声或有偏的奖励信号，尤其影响长期记忆操作的学习效果。解决方案的关键在于提出Memory-R2训练框架，其核心算法LoGo-GRPO通过结合局部与全局的组相对优化：局部重采样（rerollouts）在相同中间记忆状态下比较不同记忆操作的结果，实现更公平的群体比较；全局目标则保留从长程轨迹中获取的端到端奖励信号，确保整体策略优化的有效性。此外，Memory-R2采用共享参数的协同学习设计，将事实提取器与记忆管理者统一建模于同一LLM骨干网络，通过角色提示区分功能，并引入渐进式课程学习策略（训练轮次从8逐步增至32），以稳定长时间跨度下的多步强化学习过程，从而为长周期多轮场景中的记忆增强型LLM智能体提供高效且稳定的训练范式。

链接: https://arxiv.org/abs/2605.21768
作者: Sikuan Yan,Ahmed Bahloul,Ercong Nie,Susanna Schwarzmann,Riccardo Trivisonno,Volker Tresp,Yunpu Ma
机构: Ludwig Maximilian University of Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Huawei Heisenberg Research Center (慕尼黑); Technical University of Munich (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent’s past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

[MA-13] Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems

【速读】：该论文旨在解决异质多团队协作中因机器人资源分配不当而导致的协同效率低下问题，尤其在存在能力差异、转移成本及能力依赖性贡献的情况下。其核心解决方案是基于生态学中的汉密尔顿规则（Hamilton’s rule）设计一种具有 altruistic（利他主义）决策机制的多团队资源分配框架，并通过图神经网络（Graph Neural Network, GNN）实现可扩展的策略学习：该模型在集中训练、分散执行（centralized training with decentralized execution）范式下运行，以团队交互图为输入，预测机器人级别的转移决策与下一阶段的机器人-团队分配。实验表明，该方法在消防救援场景中能逼近最优性能并有效扩展至大规模系统。

链接: https://arxiv.org/abs/2605.21723
作者: Riwa Karam,Ruoyu Lin,Brooks A. Butler,Magnus Egerstedt
机构: Georgia Institute of Technology (佐治亚理工学院); Georgia Institute of Technology (佐治亚理工学院); Georgia Institute of Technology (佐治亚理工学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton’s rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton’s rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.

[MA-14] Planning Scheduling and Behavior in EV Charging Systems: A Critical Survey and Trilemma Framework

【速读】：该论文试图解决电动汽车（EV）充电基础设施研究中因多层决策相互耦合而产生的碎片化问题，具体聚焦于规划（Planning）、调度（Scheduling）与用户行为（Behavior）三者之间的复杂交互关系。其解决方案的关键在于提出一个统一的“规划-调度-行为”（PSB）框架，明确区分各层在决策时域、主体目标和耦合结构上的差异，并揭示出一种“保真度-可计算性权衡”（PSB三难困境）：单层问题本身已具计算难度，而真实跨层整合通常需降低至少一层的模型保真度，从而导致对长期投资反馈、电网动态或用户异质性响应等关键机制的忽略。通过系统梳理三组两两耦合的研究文献，论文指出当前方法普遍将被忽略的第三层外生设定或用静态聚合代理表示，虽提升可计算性但牺牲了政策相关性和系统洞察力，进而识别出面向新兴充电技术、行为激励机制、公平性度量及城市尺度学习方法等领域的开放挑战。

链接: https://arxiv.org/abs/2605.21665
作者: Peiyan Xiao,Yuheng Li,Ayan Mukhopadhyay,Sai Krishna Ghanta,Sabur Baidya,Yanhai Xiong
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Review article; 56 pages excluding references; 1 figure and 3 tables

点击查看摘要

Abstract:The rapid growth of electric vehicles is shifting the main constraint on transport electrification from vehicle adoption to the deployment and operation of charging infrastructure. Charging-network design requires decisions across three interdependent layers: Planning, which determines where and how much infrastructure to build; Scheduling, which governs charging dispatch, pricing, and grid interaction; and Behavior, which captures how users choose stations, charging times, and charging durations. Existing studies have advanced each layer substantially, but the literature remains fragmented, and cross-layer interactions are often treated through simplifying assumptions. This survey develops a three-layer Planning-Scheduling-Behavior (PSB) framework to organize EV charging research according to decision horizon, actor objective, and coupling structure. We further identify a fidelity-tractability tradeoff, termed the PSB trilemma: each layer is computationally difficult in isolation, and realistic integration across layers generally requires reducing the fidelity of at least one layer. Reviewing the three pairwise-coupling literatures - Planning-Scheduling, Scheduling-Behavior, and Planning-Behavior - we show that the omitted third layer is typically fixed exogenously or represented by a static aggregate surrogate. These simplifications enable tractability but impose distinct costs: they can obscure long-term investment feedback, temporal grid and emissions dynamics, or heterogeneous user response and equity outcomes. Building on this diagnosis, we identify open challenges in emerging charging technologies, behavioral incentives, equity metrics, and city-scale learning-based methods that balance fidelity, interpretability, and policy relevance.

[MA-15] Argo: Efficient Importance Labeling for Enterprise Email Systems

【速读】：该论文旨在解决企业级电子邮件重要性标注（email importance labeling）中传统方法依赖人工特征工程、难以扩展且泛化能力差的问题，以及当前基于大语言模型（LLM）如GPT-4.1的方案因计算成本过高而无法在实际场景中部署的问题。其解决方案的关键在于提出名为Argo的企业级邮件标注框架：首先构建一个高效搜索工具（profiler），用于探索标注质量与成本之间的权衡空间，识别出接近GPT水平但成本显著更低的替代标注策略；其次设计一种按需资源调度机制（on-demand provisioning scheme），动态适应实时负载变化，在高峰时段最小化推理成本增长。实验表明，Argo在三个开源邮件数据集上实现了148–167倍的推理成本降低，同时质量损失可忽略，并将配置分析成本降低20–640,000倍，使大规模、上下文感知的邮件标注成为企业可行方案。

链接: https://arxiv.org/abs/2605.21604
作者: Siddhant Ray,Ganesh Ananthanarayanan,Kevin Chian,Yan Guo,Cristina St Hill,Jack W. Stokes,Victor Wang,Junchen Jiang
机构: University of Chicago (芝加哥大学); Microsoft (微软); TensorMesh
类目: Multiagent Systems (cs.MA)
备注: 15 pages, 19 figures

点击查看摘要

Abstract:Email importance labeling has long been a critical yet challenging problem for businesses and individuals. Traditional approaches; such as keyword matching, user-defined rules, and sender-based heuristics; demand extensive manual feature engineering and fail to scale effectively or generalize. Recent advances in large language models (LLMs) demonstrate strong potential and a natural fit for this task, offering deep contextual understanding and superior labeling quality. However, using LLM models like GPT-4.1 at enterprise email volumes incurs prohibitive computational costs and hinders real-world deployment. We explore the trade-off space of using alternative labeling schemes as opposed to GPT4.1 scale LLMs, with the goal of achieving near GPT level labeling quality with significantly lower cost. We develop Argo, an enterprise email labeling framework, where we construct a profiler to efficiently search the cost quality trade-off space of labeling and identify cost-efficient alternatives to labeling emails. Additionally, we design an on-demand provisioning scheme to intelligently scale Argo with real time load, to minimize cost increases during peak load inference. Over 3 open-source email datasets, Argo achieves 148-167X inference cost reduction with negligible quality degradation and 20-640000X lower profiling costs, making large-scale, context-aware email labeling practical for enterprises.

自然语言处理

[NLP-0] okenisation via Convex Relaxations

【速读】：该论文试图解决现有分词算法（如BPE和Unigram）因采用贪心策略而导致的局部最优问题，即这些算法在构建词汇表时仅基于局部决策，未全局优化整体分词效果。解决方案的关键在于将分词器构造建模为一个线性规划问题，并利用凸优化工具求解，从而提出了一种名为ConvexTok的新算法。该方法不仅能显著提升内在分词指标和语言模型的每字节比特数（bits-per-byte, BpB），还允许用户通过下界估计来量化当前分词器与最优解的距离，实证表明其在常见词汇表规模下距离最优解不超过1%。

链接: https://arxiv.org/abs/2605.22821
作者: Jan Tempus,Philip Whittington,Craig W. Schmidt,Dennis Komm,Tiago Pimentel
机构: ETH Zurich; Kensho Technologies
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms – they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1% of optimal at common vocabulary sizes.

[NLP-1] Vector Policy Optimization: Training for Diversity Improves Test-Time Search

【速读】：该论文试图解决当前大语言模型（LLM）在推理阶段搜索（如AlphaEvolve等进化搜索算法）中因响应分布熵低而导致多样性不足的问题。标准的后训练强化学习（RL）方法通常优化单一标量奖励，导致模型输出缺乏多样性，难以适应多样的下游奖励函数。解决方案的关键是提出向量策略优化（Vector Policy Optimization, VPO），这是一种可直接替换GRPO优势估计器的RL算法，其核心在于显式训练模型以预测多种向量值奖励（如代码生成中的每测试用例正确性或不同用户人格对应的奖励），并生成一组针对向量奖励空间中不同权衡特化的解。实验表明，VPO在四项任务中匹配或超越了最强的标量RL基线，在测试时搜索性能（如pass@k和best@k）随搜索预算增加而持续提升，且在进化搜索场景下使原本无法解决的问题变为可解。随着测试时搜索的标准化，优化多样性可能成为默认的后训练目标。

链接: https://arxiv.org/abs/2605.22817
作者: Ryan Bahlous-Boldi,Isha Puri,Idan Shenfeld,Akarsh Kumar,Mehul Damani,Sebastian Risi,Omar Khattab,Zhang-Wei Hong,Pulkit Agrawal
机构: MIT; Improbable AI Lab; MIT-IBM Computing Research Lab; Sakana AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 24 pages

点击查看摘要

Abstract:Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

[NLP-2] Evaluating Commercial AI Chatbots as News Intermediaries

【速读】：该论文试图解决的问题是：当前生成式 AI（Generative AI）聊天机器人在处理跨语言和跨区域的新兴事实时，其准确性尚未得到系统性评估，尤其是在整合专有搜索功能和检索-合成流程的情况下。解决方案的关键在于通过一项为期14天（2026年2月9日至22日）的实证研究，对六款主流AI聊天机器人（包括Gemini 3、Grok 4、Claude 4.5 Sonnet、GPT-5 和 GPT-4o mini）进行大规模测试，基于来自BBC六大区域新闻服务（美加、阿拉伯语、非洲、印地语、俄语、土耳其语）的2100个事实性问题进行评估。研究发现，尽管最佳模型在多选题中准确率超过90%，但在自由回答模式下下降11–13%，整体下降16–17%；并识别出三大核心失败模式：（1）存在明显的区域不平等，尤其是印地语任务准确率最低（79%），源于英语检索偏倚；（2）超70%错误由检索失败引起，而非推理能力不足；（3）模型对包含细微虚假前提的问题极为脆弱，准确率骤降至19–70%，且存在“检测-准确率悖论”——即最强的前提检测器并非最优的对抗性准确率表现者，表明前提识别与答案恢复是两个相对独立的能力。这些结果揭示了高准确率可能掩盖系统性的区域不公、对检索基础设施的高度依赖以及对真实用户输入中模糊或误导性问题的敏感性。

链接: https://arxiv.org/abs/2605.22785
作者: Mirac Suzgun,Emily Shen,Federico Bianchi,Alexander Spangher,Thomas Icard,Daniel E. Ho,Dan Jurafsky,James Zou
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

[NLP-3] Reducing Political Manipulation with Consistency Training

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在敏感政治语境下表现出系统性隐性政治偏倚（covert political bias）的问题。具体而言，LLMs 对对立政治立场的议题处理存在不对称性，这种偏倚不易察觉但具有潜在危害。解决方案的关键在于提出两种可量化评估隐性偏倚的新指标：Sentiment Consistency（情感一致性）用于衡量不同政治立场提示下的修辞与框架对称性，Helpfulness Consistency（帮助性一致性）用于衡量响应深度与参与度的对称性；并进一步设计了一种名为 Political Consistency Training (PCT) 的强化学习（Reinforcement Learning, RL）训练方法，包含情感一致性和帮助性一致性两个互补训练范式，从而在保持整体有用性的前提下显著降低隐性政治偏倚，并具备良好的泛化能力。

链接: https://arxiv.org/abs/2605.22771
作者: Long Phan,Devin Kim,Alexander Pan,Alice Blair,Adam Khoja,Dan Hendrycks
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at this https URL

[NLP-4] Understanding Data Temporality Impact on Large Language Models Pre-training

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在训练过程中因使用打乱语料而导致知识固化、时间敏感事实认知模糊的问题，特别是数据顺序对模型获取时序知识的影响。解决方案的关键在于：首先构建了一个包含7000余条具有时间锚定的问答数据集及相应的评估协议，用于量化分析模型是否能正确将事实与对应时间段关联；其次，通过在按时间顺序排列的Common Crawl快照上预训练60亿参数模型，并与传统随机打乱语料训练的基线模型进行对比，发现顺序训练模型在保持通用语言理解能力的同时，显著提升了事实的新鲜度与时序准确性，表明时序有序预训练有助于增强模型的持续学习能力和对时效性知识的捕捉能力。

链接: https://arxiv.org/abs/2605.22769
作者: Pilchen Hippolyte,Fabre Romain,Signe Talla Franck,Perez Patrick,Grave Edouard
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at this https URL , checkpoints, and datasets at this https URL provide a foundation for future research on continual learning for LLMs.

[NLP-5] ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

【速读】：该论文旨在解决现有生物医学知识图谱（Biomedical Knowledge Graphs, BKGs）中缺乏时间维度信息的问题，即这些图谱将疾病关联视为静态事实，而临床推理高度依赖于症状或体征在不同时间点的动态变化（例如，3岁时出现的症状可能提示一种疾病，而在13岁时则可能指向另一种）。其核心解决方案是构建ChronoMedKG——一个包含460,497个经证据支持的三元组的时序生物医学知识图谱，覆盖13,431种疾病，每个关联均绑定到发病窗口或疾病进展阶段等时间组件，并通过PMID可追溯的证据和多信号可信度评分进行验证。该图谱采用疾病无关的多智能体流水线，由多个前沿大语言模型（LLMs）独立从PubMed和PMC文献中提取知识，仅保留多模态共识、通过可信度过滤且符合本体对齐标准的关系。实验表明，ChronoMedKG在Orphadata上达到92.7%的一致性，并为6,250种此前未被HPOA、Orphadata及Phenopackets覆盖的疾病提供了时序标注，包括1,657种孤儿病编码疾病；此外，作者还引入ChronoTQA基准测试，验证了ChronoMedKG在检索增强临床系统中的显著优势：相比静态知识库，其能恢复前沿LLMs在处理时序问题时约47–65%的长尾失败案例，远高于HPOA-RAG的17–29%。

链接: https://arxiv.org/abs/2605.22734
作者: Md Shamim Ahmed,Farzaneh Firoozbakht,Lukas Galke Poech,Jan Baumbach,Richard Röttger
机构: University of Southern Denmark (南丹麦大学); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注: 9 pages main text plus appendices, 8 figures. Dataset and benchmark paper. ChronoMedKG released under CC BY 4.0 and ChronoTQA/code under MIT (Zenodo: https://doi.org/10.5281/zenodo.19697542 ). Under review

点击查看摘要

Abstract:Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent. Comments: 9 pages main text plus appendices, 8 figures. Dataset and benchmark paper. ChronoMedKG released under CC BY 4.0 and ChronoTQA/code under MIT (Zenodo: https://doi.org/10.5281/zenodo.19697542). Under review Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.4; H.3.3; J.3 Cite as: arXiv:2605.22734 [cs.CL] (or arXiv:2605.22734v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.22734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-6] AMEL: Accumulated Message Effects on LLM Judgments

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在作为自动化评估者时，其后续判断是否会受到先前对话历史极性的影响，即是否存在“累积消息效应”（Accumulated Message Effect on LLM Judgments, AMEL）。解决方案的关键在于实证发现：LLM的判断确实会系统性地向历史对话的主导极性偏移（d = -0.17，p < 10⁻⁴⁶），且这种偏移集中在模型初始不确定性的高熵任务上（d = -0.34），而非确定性判断任务（d = -0.15）；此外，负向历史比正向历史引发更强的偏移（负向偏移强度为正向的1.62倍），且偏移不随上下文长度增加而增强，说明并非由记忆容量限制所致。研究进一步揭示机制层面的三个核心发现：偏移是连续发生的而非阈值突变、负向偏移包含token级与语义级双重成分、偏移对历史中具体位置不敏感。最直接有效的缓解策略是在评估流水线中为每个测试项使用独立上下文，或在必须批量处理时平衡正负历史样本。

链接: https://arxiv.org/abs/2605.22714
作者: Sid-ali Temkit
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 14 figures, 6 tables. Single author. Code, data (75,898 deduplicated API responses), and analysis pipeline at this https URL

点击查看摘要

Abstract:Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation’s prevailing polarity (d = -0.17, p 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps. Comments: 19 pages, 14 figures, 6 tables. Single author. Code, data (75,898 deduplicated API responses), and analysis pipeline at this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2605.22714 [cs.AI] (or arXiv:2605.22714v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.22714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-7] okenization with Split Trees

【速读】：该论文试图解决传统子词分词方法（如BPE、WordPiece和UnigramLM）在词汇表构建时无法有效优化压缩效率与模型推理性能之间的权衡问题，尤其是在高词汇量场景下token数量冗余和上下文长度受限的问题。其解决方案的关键在于提出一种名为“Split Trees”的递归推理机制，通过预计算字节n-gram统计信息，将每个预分词（pretoken）贪婪地分裂为完整的二叉树结构，并基于此结构设计一个整数规划（Integer Program, IP）来选择最优词汇表，以最小化所有split tree路径上的总token数。该IP的线性规划（LP）松弛在实践中接近整数解，可高效生成近似最优词汇表，且训练时间随split tree数量呈二次增长，具备良好的可扩展性。实验表明，ToaST在英语文本上相比基线方法减少超过11%的token数量，提升Renyi效率并延长有效上下文长度，在1.5B参数语言模型训练中实现最高CORE分数，显著优于基线方法。

链接: https://arxiv.org/abs/2605.22705
作者: Craig W. Schmidt,Michael Krumdick,Adam Wiemerslage,Seth Ebner,Varshini Reddy,Yuval Pinter,Chris Tanner
机构: Kensho Technologies; Ben-Gurion University; MIT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%–7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

[NLP-8] Self-Policy Distillation via Capability-Selective Subspace Projection

【速读】：该论文试图解决自蒸馏（self-distillation）在大语言模型（LLM）训练中面临的两个核心问题：一是现有方法要么依赖外部信号进行自生成输出的筛选（如正确性过滤、执行反馈或奖励搜索），这在前沿高性能模型上不可行且成本高昂；二是直接使用全部原始生成结果进行训练，导致模型性能受限于特定领域且难以泛化。此外，这些方法还存在一个深层缺陷：自生成输出中混杂了任务相关能力与风格模式、格式错误和模型特异性噪声，从而稀释了目标能力的优化信号。解决方案的关键在于提出自策略蒸馏（Self-Policy Distillation, SPD），其创新性地通过分析模型在正确性定义标记上的梯度提取低秩能力子空间，并在自生成过程中将键值（KV）激活投影到该子空间，随后仅用标准的下一个词预测损失对投影后的原始输出进行微调。这种方法无需任何外部信号即可实现通用性强、能力选择性明确的蒸馏，实验证明其在代码生成、数学推理和多项选择题问答任务中相较无外部信号的最先进方法提升高达13%，相较预训练基线提升达16%，且在域外泛化场景下表现更优（提升15%）。

链接: https://arxiv.org/abs/2605.22675
作者: Guangya Hao,Yitong Shang,Yunbo Long,Zhuokai Zhao,Hanxue Liang
机构: University of Cambridge (剑桥大学); HKUST (香港科技大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model’s own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.

[NLP-9] Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

【速读】：该论文试图解决跨语言道德价值观分类中因语言差异导致的翻译失真问题，尤其是针对英语以外语言在道德语料库资源匮乏的现状。其解决方案的关键在于利用大语言模型（LLM）进行直接翻译，并通过四阶段验证流程（包括LaBSE跨语言嵌入相似性、中心核对齐（CKA）、LLM作为评判者评估及深度学习分类器一致性测试）证明：尽管存在俚语、粗俗表达和文化负载词的处理挑战，直接翻译仍能较好保留细微道德线索，使跨语言机器学习模型有效捕获道德信号——平均余弦相似度达0.86，AUC差距仅0.01–0.02，且微调后进一步缩小。这表明机器翻译是低成本、高可行性的路径，可拓展至波兰语等斯拉夫语系及其他低资源语言的道德研究。

链接: https://arxiv.org/abs/2605.22660
作者: Maciej Skorski
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using \sim 50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning – with mean cosine similarity of 0.86 and AUC gaps of 0.01–0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

[NLP-10] Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLM s

【速读】：该论文试图解决的问题是：大型语言模型（LLM）在检测现代中文诗歌生成内容方面的性能不足，且此前缺乏针对该任务的系统性研究。解决方案的关键在于提出一种图像语义引导的诗歌检测方法（image-semantic guided poetry detection method），通过引入与诗歌内容相关的图像信息，融合文本语义、意象和情感等多模态特征，从而增强LLM作为检测器的能力。该方法利用示例驱动的方式将图像中的语义信息与诗文内容进行互补判断，实验证明其显著优于仅依赖纯文本的传统检测方法，甚至超越了最佳传统模型RoBERTa，在Gemini模型上达到85.65%的Macro-F1分数，达到了当前最优水平。

链接: https://arxiv.org/abs/2605.22654
作者: Shanshan Wang,Fengying Ye,Hanjia Lyu,Caiwen Gou,Junchao Wu,Jingming Yao,Chengzhong Xu,Jiebo Luo,Derek F. Wong
机构: University of Macau (澳门大学); University of Rochester (罗切斯特大学); Sichuan University (四川大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

[NLP-11] Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

【速读】：该论文试图解决的问题是：在人工智能（AI）日益融入日常生活的背景下，不同利益相关者（如学术界、个人、私营部门）如何理解并构想AI技术对社会、政治和经济现实的影响，以及政策制定是否充分反映了这些多元声音。解决方案的关键在于构建一个用于清理和分析公众意见语料库的自动化流程，并通过主题建模与频率分析识别各子群体关注的核心议题，进而对比这些议题与特朗普政府《美国AI行动计划》中的内容差异。研究发现，个体更关注AI对生活的影响，而私营部门则聚焦于安全、政策与开发问题；相比之下，《AI行动计划》主要反映了私营部门关切，个体的声音代表性不足。

链接: https://arxiv.org/abs/2605.22650
作者: Alina Karakanta,Alex Christiansen,Tomás Dodds,Bissie Anderson,Matteo Fuoli,Marcus Perlman,Aletta G. Dorst
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions of AI based on a corpus of letters submitted during the public consultation for the Trump Administration’s US AI Action Plan. To this aim, we release a corpus cleaning pipeline and perform topic modelling and frequency analysis to explore predominant topics discussed by different subgroups (e.g., academia, individuals, private sector) and those appearing in the AI Action Plan. Our results show that individuals voice strong concerns related to the impact of AI on life, while other stakeholders are more concerned with AI development. Our comparison of topics suggests that the AI Action Plan reflects predominantly the concerns of the private sector on security, policies, and development, with individuals’ concerns less represented.

[NLP-12] Boiling the Frog: A Multi-Turn Benchmark for Agent ic Safety

【速读】：该论文试图解决的问题是：当大语言模型被部署为智能体（agent）时，传统基于文本生成的安全评估方法已不再适用，因为此时安全风险的核心从模型输出的文本内容转变为模型在环境中执行的操作行为。为此，作者提出了一种新的评估框架——Boiling the Frog，其关键解决方案在于设计了一个面向工具使用型AI模型的多轮状态感知（stateful multi-turn）基准测试，通过模拟渐进式攻击（incremental attacks）来检验模型在办公场景中对风险请求的响应能力。该基准将攻击逐步引入到一个持续存在的工作空间中，并量化最终生成的“产物状态”是否变得不安全，从而更真实地反映现实世界中AI代理可能面临的系统性安全威胁。

链接: https://arxiv.org/abs/2605.22643
作者: Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Federico Sartore,Enrico Panai,Laura Caroli,Yue Zhu,Adam Leon Smith,Luca Nannini,Marcello Galisai,Susanna Cifani,Francesco Giarrusso,Marcantonio Bracale Syrnikov,Daniele Nardi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act’s Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.

[NLP-13] More Context Larger Models or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

【速读】：该论文旨在解决政治文本中隐含的施瓦茨价值观（Schwartz values）检测难题，尤其是当这些价值观依赖于上下文论证和相邻价值之间的细微差异时。其解决方案的关键在于系统性地评估三种因素：输入上下文长度（句子级、窗口级、全文）、是否引入检索增强生成（RAG）技术及道德知识库、以及模型架构与规模（如DeBERTa-v3与大语言模型LLM）。研究发现，增加上下文并非总是有益——全文档输入显著提升监督式DeBERTa模型（宏F1提升3.8–4.8点），但对零样本LLM无一致改善；而基于道德知识的检索增强在多数场景下更稳定有效，尤其在早期融合（early fusion）策略下对所有模型家族均有增益；此外，模型规模扩大并不保证性能提升，且简单早期融合优于复杂的后期融合或交叉注意力RAG变体。进一步的逐值分析表明，上下文与检索最有助于识别社会情境性强或概念易混淆的价值观。因此，论文强调价值敏感的自然语言处理应综合考量上下文、外部知识与模型类型，而非盲目追求更长输入或更大模型。

链接: https://arxiv.org/abs/2605.22641
作者: Víctor Yeste,Paolo Rosso
机构: PRHLT Research Center, Universitat Politècnica de València, Spain; School of Science, Engineering and Design, Universidad Europea de Valencia, Spain; Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL , best model: this https URL , 18 pages, 3 figures

点击查看摘要

Abstract:Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8–4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

[NLP-14] he Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution ICML2026

【速读】：该论文旨在解决多任务学习（Multi-task Learning, MTL）在自动放射学报告生成（Radiology Report Generation, RRG）中因采用粗粒度线性加权策略而导致的临床判别监督约束与报告生成平滑性需求难以平衡的问题。其核心解决方案是提出一种与骨干网络无关的优化器——冲突规避幅度增强梯度下降法（Conflict-Averse Magnitude-Enhanced Gradient Descent, CAME-Grad），该方法通过冲突规避方向校正和幅度增强能量注入机制，在保证几何有效性的前提下避免陷入局部最优，并结合自适应梯度融合机制动态协调理论最优方向与任务特定归纳偏置之间的平衡，从而显著提升临床效能表现。

链接: https://arxiv.org/abs/2605.22635
作者: Erjian Zhang,Yatong Hao,Liejun Wang,Zhiqing Guo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a “Double Dilemma” of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at this https URL.

[NLP-15] wo is better than one: A Collapse-free Multi-Reward RLIF Training Framework

【速读】：该论文试图解决的问题是：当前基于内部反馈的强化学习（Reinforcement Learning from Internal Feedback, RLIF）方法在无监督场景下训练大语言模型（LLM）时，因依赖单一内部奖励信号而容易出现奖励黑客（reward hacking）、熵崩溃（entropy collapse）以及推理结构退化等问题，从而限制了模型长期推理任务的稳定性和性能。解决方案的关键在于提出一种多奖励RLIF框架，将训练信号分解为两个互补的组件：基于聚类投票的答案级奖励（answer-level reward）和基于token级自确信度的完成级奖励（completion-level reward），并通过GDPO-based归一化缓解奖励尺度不平衡问题；同时引入KL-Cov正则化项，抑制低熵token分布导致的过度熵减少，从而保留探索能力并防止训练后期的崩溃现象。实验表明，该方法在数学推理与代码生成基准上显著提升了无监督RLIF的稳定性与鲁棒性，并逼近有监督RLVR方法的性能表现。

链接: https://arxiv.org/abs/2605.22620
作者: Shourov Joarder,Diganta Sikdar,Ahsan Habib Akash,Binod Bhattarai,Prashnna Gyawali
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); West Virginia University (西弗吉尼亚大学); University of Aberdeen (阿伯丁大学); Fogsphere (Redev.AI Ltd, UK) (Fogsphere (Redev.AI有限公司，英国)); University College London (伦敦大学学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.

[NLP-16] Chinese sensorimotor and embodiment norms for 3000 lexicalized concepts

【速读】：该论文试图解决的问题是：如何理解概念知识如何根植于身体体验，以及机器系统在缺乏直接感官运动经验的情况下能否获取此类知识。这一问题在认知科学和具身人工智能研究中具有核心意义。现有针对非印欧语言的大规模规范性资源稀缺，限制了相关实证研究的开展。为此，作者构建了一个包含3,000个汉语词汇概念的新型规范数据库，涵盖11维感官运动评分和单维具身性评分，数据来自378名母语为普通话的参与者。关键解决方案在于引入一个理论驱动的指标——感知具身强度（Perceptual Strength of Embodiment, PSE），并通过词汇判断任务验证其对词汇加工的预测效力，结果表明PSE-Sensorimotor和Minkowski-3是最强的复合预测因子，揭示了感官运动信息对词汇处理的促进作用；此外，通过回归模型发现纯语言表征可部分恢复感官运动评分（平均Spearman相关系数r = .62），且关系几何结构也具有一定可恢复性（r = .540），支持分布语言使用编码具身概念结构的观点。

链接: https://arxiv.org/abs/2605.22616
作者: Jing Chen,Gábor Parti,Yin Zhong,Chu-Ren Huang,Marco Marelli
机构: University of Milano-Bicocca (米兰大学博科尼校区); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how conceptual knowledge is grounded in bodily experience, and to what extent machine systems can acquire such knowledge without direct sensorimotor experience, are central questions in both cognitive science and embodied artificial intelligence research. Large-scale normative resources are essential for investigating these questions empirically, yet such resources remain sparse for non-Indo-European languages. We present a novel normative database for 3,000 lexicalized concepts in Mandarin Chinese, comprising 11-dimensional sensorimotor ratings and unidimensional embodiment ratings collected from 378 native Mandarin speakers. The ratings demonstrate high reliability and strong cross-norm validity with existing Chinese resources, each of which covers fewer words and a subset of the 11 sensorimotor dimensions. In a validation study, we tested new variables derived from a theoretically motivated metric, Perceptual Strength of Embodiment (PSE) (Huang et al., 2025), together with seven common composite variables, on lexical decision tasks. The results suggest that PSE-Sensorimotor and Minkowski-3 are the strongest composite predictors of lexical decision performance, capturing the facilitatory effects of sensorimotor information on lexical processing. A further exploratory study showed that sensorimotor ratings are substantially recoverable from purely linguistic representations using simple regression models (mean Spearman r = .62 across dimensions), though recovery varied markedly: visual and auditory dimensions yielded higher correspondence than chemosensory ones. Representational similarity analysis further showed that the relational geometry of the sensorimotor space is also partially recoverable (r = .540), consistent with the view that distributional language use encodes aspects of embodied conceptual structure.

[NLP-17] Agent ic CLEAR: Automating Multi-Level Evaluation of LLM Agents ACL

【速读】：该论文试图解决的问题是：当前对自主代理（Agentic systems）行为的监督与评估工具存在局限性，主要体现在仅提供基础可观测性而缺乏动态、适应性强的评估能力，且依赖静态的手工错误分类体系，难以适配新领域。解决方案的关键在于提出 Agentic CLEAR——一个自动、动态且易用的评估框架，它能够在系统、轨迹和节点三个粒度层次上生成文本化洞察，并在可观测层之上运行，支持无缝集成与直观用户界面，从而实现高质量、数据驱动的反馈，实验表明其能有效识别人类标注错误并预测任务成功率。

链接: https://arxiv.org/abs/2605.22608
作者: Asaf Yehudai,Lilach Eden,Michal Shmueli-Scheuer
机构: IBM Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL

点击查看摘要

Abstract:Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

[NLP-18] A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models

【速读】：该论文试图解决生成模型中如何从微分方程视角统一理解扩散模型（Diffusion Models）的问题，特别是如何将前向和反向过程形式化为常微分方程（ODE）和随机微分方程（SDE），从而揭示其数学本质并指导高效采样。解决方案的关键在于：首先构建一个条件高斯前向过程，并证明其可被表示为ODE或SDE；通过在数据分布上取平均，得到边际化的前向ODE/SDE，实现从数据分布 $ p_0 $ 到标准高斯先验 $ p_1 $ 的平滑传输；进而推导出对应的反向时间动态——反向SDE与反向概率流ODE，二者均由边际得分函数 $ \nabla \log p_t(x) $ 控制；由此建立基于得分匹配的训练目标，并证明噪声预测目标等价于得分匹配（仅差一个与模型参数无关的常数项）；最后，将DDPM与DDIM统一到该框架下，指出它们共享相同的训练目标，但采样方式分别对应离散反向SDE和反向ODE采样，从而为不同扩散模型提供了统一的理论基础和采样优化路径（如DPM-Solver）。

链接: https://arxiv.org/abs/2605.22586
作者: Jiayi Fu,Yuxia Wang
机构: INSAIT, Sofia University “St. Kliment Ohridski”, Bulgaria
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: A detailed tutorial on Diffusion models and SDE

点击查看摘要

Abstract:This tutorial develops diffusion models from the viewpoint of differential equations. We begin with the conditional Gaussian forward process and show that this path admits both an ordinary differential equation (ODE) representation and a stochastic differential equation (SDE) representation. Averaging the conditional process over the data distribution then yields marginalized forward ODE and SDE formulations that transport the data distribution p_0=p_\mathrmdata to a Gaussian prior p_1=\mathcalN(0,I) . We next derive the corresponding reverse-time dynamics, namely the reverse SDE and the reverse probability-flow ODE, both of which are governed by the marginal score \grad\log p_t(x) . This leads to a training objective for score estimation and shows that the standard noise-prediction objective is equivalent to score matching up to an additive constant independent of the model parameters. We then discuss sampling methods for the learned reverse dynamics, including DPM-Solver, as well as guided sampling through classifier guidance and classifier-free guidance. Finally, we compare DDPM and DDIM with the reverse SDE/ODE framework and show that they share the same training objective, while DDPM sampling corresponds to discrete reverse-SDE sampling and DDIM sampling corresponds to reverse-ODE sampling.

[NLP-19] Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion ICML2026

【速读】：该论文试图解决的问题是：在小数据集上对大语言模型（LLMs）进行近似零训练损失的微调（即“超拟合”现象，Hyperfitting）为何能显著提升生成质量并减少贪婪解码中的重复问题，其内在机制尚不明确。解决方案的关键在于揭示了超拟合并非简单的分布锐化（如温度缩放），而是依赖于一种动态的、上下文相关的词表排序重排机制；具体而言，该机制通过最终Transformer层的“终端扩展”（Terminal Expansion）实现特征空间几何扩张（维度增加约+80.8），从而促进低频词（deep-tail tokens）的生成。此外，作者提出了一种名为Late-Stage LoRA的针对性微调策略，仅更新最后5层，在保持极小参数更新的同时实现了鲁棒的生成效果。

链接: https://arxiv.org/abs/2605.22579
作者: Meimingwei Li,Yuanhao Ding,Esteban Garces Arias,Christian Heumann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Recent work has identified a counterintuitive phenomenon termed “Hyperfitting”, where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a “Terminal Expansion” in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

[NLP-20] LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance ACL2026

【速读】：该论文试图解决多语言大语言模型（LLMs）在强化学习（Reinforcement Learning, RL）增强多步推理过程中存在的语言一致性与推理质量之间的权衡问题。现有方法要么过度强调输入语言的一致性，导致推理能力下降；要么追求推理优化，引发模型输出向英语偏移的“语言漂移”现象。解决方案的关键在于提出 LANG 框架，其核心机制包括：（1）基于语言条件提示（language-conditioned hints）引导非英语推理任务中的探索过程；（2）引入渐进式衰减调度策略逐步撤除提示支撑，避免对提示的依赖；以及（3）设计语言自适应切换机制，根据各语言的难度动态调整学习时长。实验表明，LANG 在多语言数学基准测试中显著提升了推理性能并保持了语言一致性，并且该框架可泛化至其他任务，改善模型各层的语言对齐效果。

链接: https://arxiv.org/abs/2605.22567
作者: Yuchun Fan,Bei Li,Peiguang Li,Yilin Wang,Yongyu Mu,Jian Yang,Xin Chen,Rongxiang Weng,Jingang Wang,Xunliang Cai,Jingbo Zhu,Tong Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (main conference)

点击查看摘要

Abstract:Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers

[NLP-21] SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

【速读】：该论文试图解决的问题是：当前用于评估工具调用智能体（tool-calling agents）的静态数据集往往受限于真实生产数据的不足或不可用性（如敏感信息、稀疏性等），导致测试不充分，因此亟需通过合成数据（synthetic data）来补充或替代真实数据，但缺乏有效的方法来量化合成数据与真实数据之间的匹配程度。

解决方案的关键在于提出 SynAE（Synthetic Agent Evaluation）框架，该框架从有效性（validity）、保真度（fidelity）和多样性（diversity）三个维度，对合成数据在四个关键类别中的表现进行多轴评估：（i）任务指令与中间响应，（ii）工具调用，（iii）最终输出，以及（iv）下游任务表现。实验表明，单一指标无法全面刻画合成数据质量，强调了多维评估的必要性，并通过可控生成方案验证了常见合成数据缺陷，从而为智能体测试提供了更可靠、细粒度的评估手段。

链接: https://arxiv.org/abs/2605.22564
作者: Shuaiqi Wang,Aadyaa Maddi,Zinan Lin,Giulia Fanti
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at this https URL, with code at this https URL.

[NLP-22] Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

【速读】：该论文试图解决的问题是：当前计算模型对词汇意义的表示大多忽略了词语在具体语境中所参与的“情境化”维度（situated dimensions），而这些维度对于理解词语的真实语义至关重要。解决方案的关键在于提出一种名为“场景抽象”（Scene Abstraction）的框架，通过少量示例提示（few-shot prompting）大型语言模型，构建结构化的词语使用场景表示，每个场景包含情境场景（Contextual Scene，含事件、实体与环境）和以表达为中心的表达特征谱（Expression Profile，含参与事件、可泛化属性及引发情绪）。该方法实现了对词语语境意义的显式建模，并通过COCA-Scenes数据集和两个实验证明了其在跨人类观察者一致性（准确率达82.4%，比纯文本嵌入高11.8个百分点）和与人类语义理解的一致性（86.4%偏好率，优于基于ATOMIC的替代方案）上的优越性。

链接: https://arxiv.org/abs/2605.22542
作者: Yejin Cho,Katrin Erk
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).

[NLP-23] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLs）在真实场景下空间智能鲁棒性不足的问题。现有空间推理评测基准普遍假设视觉输入为理想状态，忽视了实际部署中常见的图像退化现象（如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影等），从而无法准确评估模型在复杂环境中的表现。解决方案的关键在于提出SpaceDG——首个面向退化感知的空间理解大规模数据集，其核心创新是基于物理原理的退化合成引擎，将退化形成过程嵌入到3D高斯点绘（3D Gaussian Splatting, 3DGS）渲染流程中，实现九类真实退化的可控模拟；同时构建了SpaceDG-Bench这一人工验证的基准测试集，涵盖11类空间推理任务与9类退化类型，共超1万条视觉问答（VQA）实例。实验表明，视觉退化显著削弱现有MLLMs的空间推理能力，而基于SpaceDG的微调可有效提升模型在退化条件下的鲁棒性，并在不牺牲干净图像性能的前提下超越人类表现，证明了退化感知训练对构建更可靠空间智能系统的潜力。

链接: https://arxiv.org/abs/2605.22536
作者: Xiaolong Zhou,Yifei Liu,Ziyang Gong,Jiarui Li,Qiyue Zhao,Muyao Niu,Yuanyuan Gao,Le Ma,Xue Yang,Hongjie Zhang,Zhihang Zhong
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Electronic Science and Technology of China (电子科技大学); Chongqing University (重庆大学); The University of Tokyo (东京大学); Beihang University (北京航空航天大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

[NLP-24] Polite on the Surface Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation

【速读】：该论文试图解决的问题是：当前多语言大语言模型（Multilingual Large Language Models, MLLMs）在处理低资源语言（如孟加拉语 Bangla）时，存在严重的语用学差距（pragmatic gap），尤其是在结构变化、地域习语和敬语一致性等方面表现不佳。解决方案的关键在于构建了一个名为 BLADE 的文化对齐指令微调数据集，包含 4,196 对精心标注的交互对，并基于此数据集采用参数高效微调技术（LoRA 适配器）与 4-bit NormalFloat（NF4）量化框架，对 DeepSeek-8B 和 LLaMA-3.2-3B 等主流开源模型进行系统性微调与评估。实证结果表明，该方法显著提升了模型在结构保真度和敬语一致性方面的表现，为低资源多语言文本生成中的语用差异问题提供了可复现的基准和有效路径。

链接: https://arxiv.org/abs/2605.22487
作者: Md. Asaduzzaman Shuvo,Mahedi Hasan,Md. Tashin Parvez,Azizul Haque Noman,Md. Shafayet Hossain Ovi
机构: United International University, Bangladesh
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low-resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for \textbfBangLa Application and DialoguE generation - BLADE and benchmarking framework comprising 4,196 meticulously curated interaction pairs. We leverage this resource to systematically fine-tune and evaluate leading open-weight architectures, including DeepSeek-8B and LLaMA-3.2-3B, utilizing parameter-efficient fine-tuning via LoRA adapters in a 4-bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource multilingual text generation. Code and dataset: this https URL

[NLP-25] Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

【速读】：该论文旨在解决长序列中实体跟踪（entity tracking）任务中状态维护与更新的计算效率问题。现有基于注意力机制的方法虽能通过多跳状态传播压缩Transformer层数，但其密集计算仍存在高昂开销。论文发现，在此类设定下，学习到的注意力具有强结构特性：大部分注意力质量集中于局部块对角邻域，仅存在轻量级跨块残差。基于此，作者提出一种分块评估策略，利用类似求逆算子（resolvent-style operator）的结构，在保证块内交互精确的同时，通过一个降维系统传递块间交互，从而实现亚二次复杂度（ $O(n^{4/3}d)$ ，当 $d \approx n$ 时为 $O(n^{7/3})$ ）。实验表明，该方法在控制性跟踪基准上达到与密集注意力相当的准确率，同时在标准化测量协议下将实际运行时间减少12–29%，且比紧凑型密集Transformer快至2.4倍，同时保持相同精度。此外，论文还通过消融实验分析了块大小和模型容量的影响，并指出一个局限：当同时演化的属性数量超过注意力头数时，性能会显著下降。

链接: https://arxiv.org/abs/2605.22476
作者: Hangyue Zhao,Paul Caillon,Erwan Fagnou,Alexandre Allauzen
机构: ESPCI PSL, Paris, France; LAMSADE, Université Paris Dauphine - PSL, Paris, France
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 1 figure, 9 tables

点击查看摘要

Abstract:Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length O(n^4/3d) (and O(n^7/3) when d\approx n ). On controlled tracking benchmarks, our method matches the dense operator’s accuracy while reducing wall-clock time by 12-29% under a standardized measurement protocol, and is up to 2.4 \times faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

[NLP-26] In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks

【速读】：该论文试图解决多说话人环境下的语音理解问题，其核心挑战在于认知瓶颈（cognitive bottleneck），即由Ease of Language Understanding (ELU)模型定义的RAMPHO（Recall, Attention, Memory, Perception, Hierarchical Organization）情景缓冲区（episodic buffer）中的信息处理限制。现有深度神经网络在语音增强任务中仅优化物理声学指标，忽视了信息掩蔽（informational masking）带来的认知代价。论文的关键解决方案是：利用自监督声学模型wav2vec 2.0的逐帧音素熵（phonetic entropy）对RAMPHO缓冲区进行体外模拟（in silico simulation），并通过对比语义完整干扰源与相位解耦干扰源（称为“注意力屏障”，Concentration Shield）在不同信噪比（SNR）下的表现，成功将信息干扰的认知代价与能量衰减的物理代价分离。结果揭示了一个认知-声学帕累托优化问题（cognitive-acoustic Pareto optimization problem）：在高信噪比下破坏干扰源的语义内容可缓解信息掩蔽，但在低信噪比下会损害时间线索（temporal glimpsing cues）的完整性，从而影响语音感知性能。

链接: https://arxiv.org/abs/2605.22465
作者: Stefan Bleeck
机构: Institute of Sound and Vibration Research (ISVR), University of Southampton (南安普顿大学声学与振动研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing to account for the cognitive penalty of informational masking. Here, we present an in silico simulation of the RAMPHO buffer using the frame-by-frame phonetic entropy of a self-supervised acoustic model (wav2vec 2.0). By contrasting a semantically intact distractor with a phase-decorrelated distractor (the Concentration Shield) across a signal-to-noise ratio (SNR) sweep, we successfully dissociate the cognitive penalty of informational distraction from the physical penalty of energetic decay. The simulation reveals a cognitive-acoustic Pareto optimization problem: destroying a distractor’s semantic payload provides a release from informational masking at high SNRs, but fundamentally degrades temporal glimpsing cues at low SNRs.

[NLP-27] From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

【速读】：该论文旨在解决如何在Transformer语言模型中进行因果特征分析的问题，即识别并验证哪些神经元激活模式对特定任务（如间接宾语识别IOI）具有因果影响。其解决方案的关键在于提出一个五阶段方法论：探针设计、特征提取、因果验证、鲁棒性测试和部署集成，通过端到端实证在GPT-2 small模型上完成IOI任务分析。其中，激活修补（activation patching）恢复了经典的IOI电路（第9层第9头单独即可实现+1.02的恢复效果），而稀疏自编码器（sparse autoencoder）识别出按名称选择性的特征，效应量达30–50个激活单元；但因果验证表明这些特征虽具选择性，却仅部分因果——移除15个特征后模型仍能在98%提示下保持准确。进一步的NLA启发式评估显示，这15个特征仅解释31%激活方差（对比SAE的99.7%），且选择性比率与因果强度呈负相关（r = -0.56）。此外，在三种分布偏移下的鲁棒性测试揭示了检测鲁棒性与因果鲁棒性之间的差距：电路可迁移但特征消融效果显著下降。最后的成本导向部署评估（假设误报成本50，漏报成本0.42，错误率2%）发现最优监控配置每千次查询节省99.1%成本（从1000降至8.96），且最优组合策略随成本比和基线率变化而异。整个五阶段流程协同作用，产生了单阶段无法获得的综合洞见。

链接: https://arxiv.org/abs/2605.22462
作者: Caleb Munigety
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE’s 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed 50/FN, 0.42/FP, 2% error rate) finds an optimal monitor configuration yielding 8.96 per 1000 queries against a 1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

[NLP-28] Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse

【速读】：该论文旨在解决阿拉伯语社交媒体中隐性社会凝聚力（social cohesion）动态被忽视的问题，即如何在纷争与团结叙事之间量化并理解互动模式。其关键解决方案是构建Cohesion-6K数据集——一个包含六千条关于巴以冲突的阿拉伯语Facebook帖子的标注数据集，采用五类连续谱系的标签（冲突、解决、社区参与、支持性互动和共享价值），并通过人工标注结合ChatGPT辅助预标注的方式实现高一致性（Cohen’s kappa = 0.85）。定量分析揭示了显著的“参与差距”：冲突导向内容获得的用户互动量是解决导向内容的2至4倍（p < 0.01），表明煽动性话语在阿拉伯语社交平台中具有更强的可见性优势。这一发现为计算社会科学、数字传播及阿拉伯语自然语言处理提供了可复现、透明的研究资源。

链接: https://arxiv.org/abs/2605.22447
作者: Aisha Ali Al-Athba,Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The study of online discourse has become central to understanding societal polarization. While much research has focused on detecting overt toxicity, the subtle dynamics of social cohesion, meaning the interaction between divisive and unifying narratives, remain computationally underexplored (Bail, 2021; Gonzalez-Bailon and Lelkes, 2023). This paper presents Cohesion-6K, a manually and ChatGPT-assisted annotated dataset of six thousand Arabic public Facebook posts related to the Israeli Occupation of Palestine. Each post is assigned to one of five discourse categories that represent a continuum from conflict to cohesion: Conflict, Resolution, Community Engagement, Supportive Interactions, and Shared Values. The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85). Quantitative analysis reveals a consistent engagement gap, where conflict-oriented posts receive between two and four times more user interaction than resolution-oriented ones (p 0.01). This pattern illustrates how divisive discourse tends to attract disproportionate visibility in Arabic social media spaces. Cohesion-6K provides a transparent and reproducible resource for the study of online cohesion and polarization. The dataset, annotation guidelines, and preprocessing code will be released for research use under an open license, supporting future work in computational social science, digital communication, and Arabic natural language processing.

[NLP-29] Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

【速读】：该论文试图解决的问题是：在线环境中仇恨言论（hate speech）与虚假信息（misinformation）常常共现，加剧偏见与社会极化，而现有研究多将二者分开处理，缺乏针对两者协同情境下有效反言论（counterspeech, CS）生成的系统性方法。解决方案的关键在于提出三种基于知识驱动的生成策略：一是使用事实核查机构的指南和文章提示大语言模型（LLMs）；二是使用非政府组织（NGO）的指南和报告；三是融合两类来源的知识构建混合策略。实验表明，尽管原始生成的CS在40%的情况下已具可用性，但专家修订显著提升了自然度、全面性和对指南的遵循程度；进一步的人工与自动评估显示，混合策略在众包测试中表现最优，能同时实现强事实纠正、刻板印象缓解与共情式互动，从而更有效地应对复杂语境下的仇恨与虚假信息共现问题。

链接: https://arxiv.org/abs/2605.22435
作者: Genoveffa Martone,Helena Bonaldi,Marco Guerini
机构: Fondazione Bruno Kessler, Italy; Università Cattolica del Sacro Cuore, Italy
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work has addressed these phenomena separately. We bridge this gap by studying CS generation in contexts where both hate and misinformation co-occur. We test three knowledge-driven generation strategies: first we prompt an LLM with fact-checkers’ guidelines and fact-checking articles; secondly, with NGOs’ guidelines and reports; thirdly, we create a mixed strategy that combines guidelines and documents from both. 23 experts revise the generated CS, which are assessed via human and automatic metrics. While LLMs produce adequate CS in 40% of cases, expert edits substantially improve naturalness, exhaustiveness, and adherence to guidelines. Based on the post-edited CS, the mixed strategy proves to be the most effective in crowdsourcing evaluation, pairing strong factual correction with stereotype mitigation and empathetic engagement. We release a dataset of hateful and misinformed claims with expert-verified CS and supporting knowledge.

[NLP-30] DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在长时记忆问答任务中面临的挑战，即答案支持证据常分散于冗长的对话历史中，且被大量无关内容掩盖，导致传统记忆系统难以有效提取与当前查询相关的高质量证据。其解决方案的关键在于提出一种名为DeferMem的新型长时记忆框架，该框架将问题解耦为两个阶段：高召回率候选检索和查询条件下的证据蒸馏。具体而言，DeferMem采用轻量级的段落-链接结构组织原始对话历史，并在查询时动态检索广义候选片段；随后利用基于DistillPO强化学习算法训练的记忆蒸馏器，从高噪声但高召回的候选集中提炼出忠实、自包含且与查询强相关的证据。DistillPO通过结构化动作（包括消息选择与证据重写）优化蒸馏过程，采用分解式门控奖励机制和对齐结构的优势分配策略，实现从有效性到质量检查的多级奖励控制，并早期暴露任务级正确性反馈，精准分配奖励至对应输出片段。实验表明，DeferMem在LoCoMo和LongMemEval-S基准上显著优于现有基线，在问答准确率和记忆系统效率方面均达到最优，同时实现最快运行速度且无需商业API token成本。

链接: https://arxiv.org/abs/2605.22411
作者: Jianing Yin,Tan Tang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 3 figures

点击查看摘要

Abstract:Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

[NLP-31] Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

【速读】：该论文试图解决的问题是如何在多语言环境下构建高质量的食材嵌入表示，以捕捉食材之间的语义关系及其与风味化合物的关联。解决方案的关键在于提出了一种名为Epicure的三兄弟skip-gram嵌入模型家族，它们均从头训练于一个包含414万条跨7种语言（如英语、中文、俄语等）的食谱语料库，并通过LLM增强的归一化流程将原始食材字符串映射为1,790个标准化词条。模型基于两种不同类型的图结构：一种是包含203,508条边的食材-食材NPMI图，另一种是包含80,019条边的带类型标注的FlavorDB化合物-食材图（含2,247个化合物节点，分属15类）。三个变体（Cooc、Chem和Core）共享相同的架构和超参数，仅通过随机游走策略区分：Cooc仅在共现图上行走，Chem仅在化合物元路径上行走，而Core则通过可控混合注入食材-食材游走，从而在化学特征与食谱上下文之间实现连续过渡，有效平衡了风味科学与烹饪实践的表达能力。

链接: https://arxiv.org/abs/2605.22391
作者: Jakub Radzikowski,Josef Chen
机构: KAIKAKU.AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

[NLP-32] Unified Data Selection for LLM Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂长链思维（long-CoT reasoning）训练中因缺乏高质量推理数据而导致的效率瓶颈问题。现有方法要么计算成本过高，要么难以可靠地区分高质量与低质量的推理样本。其解决方案的关键在于提出一种无需训练的量化指标——高熵和（High-Entropy Sum, HES），该指标通过仅对每个推理样本中熵值最高的前0.5% token求和来评估推理质量。实验表明，HES在监督微调（SFT）、拒绝微调（RFT）和强化学习（RL）三种主流训练范式下均表现出一致性有效性，并显著降低计算开销：例如，在SFT中使用Top 20% HES排序数据即可达到全数据集性能，而在RL中基于HES筛选的成功轨迹能显著优于其他方法，从而为LLMs的高级推理能力开发提供了一种统一、高效且鲁棒的新路径。

链接: https://arxiv.org/abs/2605.22389
作者: Xiaoyuan Li,Yubo Ma,Chengpeng Li,Fengbin Zhu,Yiyao Yu,Keqin Bao,Wenjie Wang,Fuli Feng,Dayiheng Liu
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this, we propose High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality by summing only the entropy of the top (e.g., 0.5%) highest-entropy tokens in each reasoning sample. We validate HES across three mainstream training paradigms: Supervised Fine-tuning (SFT), Rejection Fine-tuning (RFT), and Reinforcement Learning (RL), with extensive results demonstrating its consistent effectiveness and significantly reduced computational overhead. In SFT, training on the top 20% HES-ranked data matches full-dataset performance, while using the lowest-HES data degrades it. In RFT, our HES-based training approach significantly outperforms baseline methods. In RL, HES-selected successful trajectories enable the model to learn strong reasoning patterns, significantly surpassing other compared methods. Our findings establish HES as a robust, training-free metric that enables a unified, effective, and efficient method for developing advanced reasoning in LLMs.

[NLP-33] Multi-Stage Training for Abusive Comment Detection in Indic Languages

【速读】：该论文旨在解决社交平台上滥用内容（abusive content）检测的难题，尤其关注如何在有效识别有害言论的同时降低误报率（false-positive rate），从而避免对合法表达自由的不当限制。其解决方案的关键在于构建一个基于语言预处理的多模型集成（ensemble of several models）管道系统，通过大量实验优化该流程，在保证高检测准确率的基础上显著减少将非滥用内容错误标记为滥用的情况。

链接: https://arxiv.org/abs/2605.22380
作者: Pranshu Rastogi,Madhav Mathur,Ramaneswaran S,Kshitij Mohan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages, EAM2021 selected

点击查看摘要

Abstract:In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.

[NLP-34] Boundary-targeted Membership Inference Attacks on Safety Classifiers

【速读】：该论文试图解决生成式AI系统中安全分类器（safety classifiers）因训练数据包含敏感内容（如自伤和心理健康讨论）而引发的隐私泄露问题，特别是通过成员推断攻击（Membership Inference Attacks, MIAs）识别出训练集中具体样本的风险。解决方案的关键在于提出一种新的边界目标选择策略（boundary-targeted selection strategy），该策略聚焦于模型置信度最低的样本，这些样本往往反映了局部泛化失败，即模型依赖记忆而非泛化能力来处理训练集中的模糊情况。实验表明，利用该策略可使攻击者以5%假阳性率恢复19%的安全分类器标记的用户困境对话，是现有最先进MIA方法的3.5倍效果；同时研究发现基于内容的过滤无法有效防护此类攻击，而现有噪声注入策略则能显著降低这些边界样本的脆弱性。

链接: https://arxiv.org/abs/2605.22373
作者: Anthony Hughes,Alexander Goldberg,Prince Jha,Adam Perer,Nikolaos Aletras,Niloofar Mireshghallah
机构: University of Sheffield (谢菲尔德大学); Carnegie Mellon University (卡内基梅隆大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is 3.5 times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

[NLP-35] Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning

【速读】：该论文试图解决的问题是：如何通过结构化的行为训练任务，使大语言模型（Large Language Models, LLMs）表现出特定的、类人的行为模式（如抑郁或偏执），并探究这种行为优化是否会导致生成分布的系统性变化。其解决方案的关键在于提出了一种行为诱导框架（behavioral induction framework），即利用受病态行为模式启发的合成数据集对Transformer架构的语言模型进行微调，使其在多种情境下稳定地选择特定类别的行动；随后通过定量和定性指标（如Jensen-Shannon散度、心理测量式评估及开放式生成结果）验证这些行为优化是否引发可检测的语言生成偏移。研究发现，不同行为目标的微调模型展现出可区分的响应倾向，表明该方法能够诱导出具有特定性的政策级偏差，而非单纯的分布偏斜，从而揭示了LLMs中行为选择与语言生成之间的深层关联，并支持将LLMs视为以策略为基础的系统，可用于研究计算认知模型中行为、解释与生成语言的关系。

链接: https://arxiv.org/abs/2605.22356
作者: Nicola Milano,Davide Marocco
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used as computational tools for modeling human-like behavior. We introduce a behavioral induction framework that modifies model policies through fine-tuning on structured decision-making tasks: using synthetic datasets inspired by maladaptive behavioral patterns, including depression and paranoia, we train transformer-based language models to consistently select specific classes of actions across diverse contexts. We then test whether this behavioral optimization produces systematic changes in generative distributions. Across two architectures, fine-tuned models show stable, context-general shifts in next-token probability distributions, including increased probability assigned to negative and threat-related interpretations in open-ended language tasks. These effects generalize beyond training contexts and are detectable in qualitative completions, psychometric-style evaluations, and quantitative distributional metrics such as Jensen-Shannon divergence. Induced behavioral profiles also show partial specificity. Models optimized for different behavioral patterns exhibit dissociable response tendencies across evaluation probes, suggesting that structured behavioral training produces differentiated policy-level biases rather than generic distributional skew. We interpret these findings as evidence that consistent behavioral optimization in LLMs can generate stable behavioral and distributional patterns consistent with altered latent priors, linking action selection and language generation. More broadly, the results support a view of LLMs as policy-based systems in which behavioral constraints shape emergent representational structure, highlighting their potential as controlled testbeds for studying the relationship between behavior, interpretation, and generative language in computational models of cognition. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.22356 [cs.CL] (or arXiv:2605.22356v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.22356 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nicola Milano [view email] [v1] Thu, 21 May 2026 11:42:38 UTC (4,683 KB)

[NLP-36] ransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

【速读】：该论文试图解决传统公共交通路径规划严重依赖结构化地图基础设施和复杂路由引擎的问题，且缺乏支持模型训练以绕过此类依赖的数据集。其解决方案的关键在于构建并发布TransitLM——一个包含超过1300万条路线规划记录的大规模数据集，覆盖中国四个城市的12.08万个站点和1.36万条线路，并设计了三个互补评估任务作为基准。实验表明，基于TransitLM训练的大型语言模型（LLM）能够以高精度生成结构有效的路线，并在无显式地图信息的情况下隐式将任意GPS坐标映射到合适的站点，从而实现完全从数据中学习的端到端、无需地图的路径生成。

链接: https://arxiv.org/abs/2605.22355
作者: Hanyu Guo,Jiedong Yang,Chao Chen,Longfei Xu,Kaikui Liu,Xiangxiang Chu
机构: AMAP, Alibaba Group (高德地图，阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at this https URL, with evaluation code at this https URL.

[NLP-37] Pattern-and-root inflectional morphology: the Arabic broken plural

【速读】：该论文旨在解决阿拉伯语名词屈折形态描述的复杂性问题，特别是传统基于“根-模式”（root-and-pattern）模型在词典管理与语言资源处理上的局限性。其解决方案的关键在于将传统模型反转为“模式-根”（pattern-and-root）结构，优先考虑模式而非根，并将屈折变化的形式描述与派生及语义部分彻底分离。这种设计使得词典可直接用于文本的形态分析，无需依赖复杂的音位规则，同时通过简化单数模式分类（以元音数量v/vv区分）、独立编码根交替与拼写变体，实现了更清晰、有序且可扩展的名词屈折分类体系——最终构建出包含300个屈折类别的系统，适用于大规模阿拉伯语名词的自动处理。

链接: https://arxiv.org/abs/2605.22310
作者: Alexis Amid Neme,Eric Laporte
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.

[NLP-38] Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

【速读】：该论文试图解决中文场景下大语言模型（Large Language Models, LLMs）毒性内容评估的盲区问题，即现有检测方法难以识别隐含性毒性（implicit toxicity），这类毒性常表现为语义间接性和表面伪装性（surface obfuscation）。解决方案的关键在于提出一种可控的红队评估与防御数据生成框架——中文隐含毒性攻击（Chinese Implicit Toxicity Attack, CITA），其核心流程包括三个阶段：(i) 有害意图学习（Harmful Intent Learning），(ii) 隐含毒性增强（Implicit Toxicity Enhancement），以及 (iii) 表面变体重写（Obfuscation Variant Rewriting），从而在保持有害意图的同时提升内容的隐含性和规避性。实验表明，基于CITA生成的数据使七种检测器平均逃避率（ASR）高达69.48%，且人工评估验证了其危害性保留和隐含性增强效果；进一步地，利用CITA生成的红队数据微调得到中文隐含毒性防御模型（CITD），证明此类数据可有效提升模型对隐含毒性攻击的鲁棒性。

链接: https://arxiv.org/abs/2605.22258
作者: Jingyi Kang,Junyu Lu,Bo Xu,Hongbo Wang,Linlin zong,Roy Ka-Wei Lee,Hongfei Lin
机构: Dalian University of Technology (大连理工大学); The University of Tokyo (东京大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.

[NLP-39] IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

【速读】：该论文试图解决语言模型在处理习语（idiom）时面临的语义理解难题，即习语的意义无法仅通过其表层形式推断，而需要超越词汇重叠的语义抽象能力。解决方案的关键在于提出一个名为IdioLink的检索基准测试，用于评估模型是否能将习语表达与其概念等价的字面或改写形式相链接。该基准包含10,700篇文档和2,140个查询，涵盖107个具有字面与隐喻用法的习语，并对每个文档和查询进行核心语义跨度标注。实验表明，当前主流嵌入模型（如BGE、E5、Contriever和Qwen）在跨不同表层形式检索等价语义方面表现不佳，主要依赖主题相关性和浅层语义线索，暴露出当前模型在习语感知语义检索中的关键缺陷，为未来研究提供了具有挑战性的测试平台。

链接: https://arxiv.org/abs/2605.22247
作者: Kai Golan Hashiloni,Daniel Fadlon,Lior Livyatan,Ofri Hefetz,Jiahuan Pei,Kfir Bar
机构: Reichman University (里奇曼大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.

[NLP-40] GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

【速读】：该论文旨在解决Aspect-based sentiment analysis (ABSA) 中模型难以准确将情感证据与正确方面（aspect）关联的问题，这本质上是一个细粒度结构推理任务。解决方案的关键在于提出 GHI 框架——一种基于二分图拓扑的 incidence-based 结构推理层，通过将语言和语义证据表示为 token 与超边之间的 incidence 关系，实现多种结构信号的统一建模接口。GHI 在六个标准 ABSA 基准上显著优于现有基线，尤其在 SemEval 领域表现突出；其参数量仅为 247M 即可逼近 11B 参数量 Flan-T5 方法在 ISE 上的性能，并在挑战性的 ARTS 数据集上展现出强鲁棒性，验证了紧凑结构推理对细粒度任务的有效性，是对当前以规模驱动方法的重要补充。

链接: https://arxiv.org/abs/2605.22228
作者: Yu Du,Wenlong Zhu,Xingze Li,Chenglong Cao,Jing Wang,Yukun Ma
机构: Qiqihar University (齐齐哈尔大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural reasoning layer built on a bipartite topology. GHI represents diverse linguistic and semantic evidence as token–hyperedge incidence relations, allowing different structural signals to be incorporated through a unified interface. Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa. Further experiments show that with only 247M parameters, GHI approaches the performance of 11B Flan-T5 based methods on the ISE benchmark. Moreover, it demonstrates strong robustness on the challenging ARTS datasets, maintaining highly competitive performance where traditional models degrade. These results demonstrate that compact structural reasoning remains a valuable alternative to scale-driven approaches for fine-grained tasks.

[NLP-41] Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

【速读】：该论文试图解决自对弈强化学习（self-play reinforcement learning）在训练语言模型时普遍存在的不稳定性和崩溃问题。现有方法多将此视为奖励设计问题，但本文提出，自对弈稳定性实际上由两个独立的调控机制决定：一是数据层的“门控机制”（data-level gate），用于筛选哪些由提议者（proposer）生成的任务进入训练池；二是奖励信号，用于更新已入选任务上的策略。通过在去除了预训练先验、输出模糊性和执行器噪声的确定性任务上进行受控实验，研究发现这两个机制具有显著不对称性：严格的门控机制足以在所有测试的奖励设置下维持稳定，包括无需真实标签的自洽奖励（self-consistency reward）；而一旦移除门控，则任何奖励设计都无法避免崩溃。这一不对称性揭示了一个反直觉的耦合现象——“有根基提议者悖论”（Grounded Proposer Paradox）：当提议者拥有真实标签访问权限时，其与自洽求解器配对会加速崩溃，因为它将训练集中于干净任务，从而快速导向虚假的自洽吸引子。进一步引入连续严格度参数 $\varepsilon$ 的分析表明存在两阶段相变：低 $\varepsilon$ 时训练指标即发生解耦，而验证准确率则保持到 $\varepsilon$ 更高时才下降。结论表明，数据层门控才是限制自对弈稳定性的关键约束，而非奖励校准。

链接: https://arxiv.org/abs/2605.22217
作者: Sophia Xiao Pu,Zhaotian Weng,Chengzhi Liu,Jayanth Srinivasa,Gaowen Liu,William Yang Wang,Xin Eric Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter \varepsilon further reveals a two-stage phase transition: training-side metrics decouple at low \varepsilon , while validation accuracy holds until \varepsilon is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

[NLP-42] Audience Engagement with Arabic Womens Social Empowerment and Wellbeing: A Decadal Corpus

【速读】：该论文旨在解决阿拉伯语语境下关于女性赋权与社会福祉议题的文本数据稀缺问题，以支持性别话语、社会改革及情感互动的大规模分析。其解决方案的关键在于构建了一个十年跨度（2013–2024）的阿拉伯语社交媒体语料库——阿拉伯女性与社会语料库（Arabic Women and Society Corpus），包含来自77个国家51,660个页面的252,487条公开Facebook帖子，涵盖超过2.67亿次用户互动，并通过自动化处理流程完成语言识别、归一化和元数据清洗，确保数据的可靠性与可复现性，从而为阿拉伯语自然语言处理（Natural Language Processing, NLP）、计算社会科学和数字传播研究提供高质量的数据基础。

链接: https://arxiv.org/abs/2605.22204
作者: Wajdi Zaghouani,Mabrouka Bessghaier,MD. Rafiul Biswas,Shimaa Amer Ibrahim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the Arabic Women and Society Corpus, a ten year collection of 252,487 public Arabic Facebook posts related to women’s empowerment and social wellbeing. The corpus was collected from 51,660 pages across 77 countries between 2013 and 2024, resulting in more than 267 million user interactions. Each post includes engagement metrics such as shares, comments, and emotional reactions, providing a unique view of audience sentiment and social attention. The data were processed using an automated pipeline with language identification, normalization, and metadata cleaning to ensure reliability and reproducibility. The corpus enables large scale analysis of gender discourse, social reform, and emotional engagement across Arabic dialects. It supports research in Arabic natural language processing, computational social science, and digital communication studies. The dataset and accompanying documentation will be released under request for research use.

[NLP-43] Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

【速读】：该论文旨在解决在检索增强生成（Retrieval-Augmented Generation, RAG）框架下，针对高语素复杂度且资源匮乏的高棉语（Khmer）农业文档中文本分块（text chunking）方法对信息检索性能的影响问题。其解决方案的关键在于通过系统比较四种分块策略——递归式（Recursive）、基于高棉语特征的（Khmer-Aware）、基于句子的（Sentence-Based）和基于大语言模型的（LLM-Based）——并结合BGE-M3多语言嵌入模型与FAISS向量检索库，在5折交叉验证下使用平均检索得分（L2距离）、答案相关性、高棉语覆盖率及高棉语交并比（Khmer IoU）四项指标进行量化评估。实验结果表明，以300字符为粒度的字符级递归分块法表现最优，显著优于基于句子的分块方式（p = 0.0121），证明了细粒度分割与结构保留对于提升低资源语言中密集检索效果的重要性。

链接: https://arxiv.org/abs/2605.22203
作者: Sovandara Chhoun,Pichdara Po,Sereiwathna Ros,Wan-Sup Cho,Saksonita Khoeurn
机构: Chungbuk National University (忠北大学校); BigDataLabs Co., Ltd. (BigDataLabs公司)
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 ± 0.0461), highest Answer Relevance (0.8663 ± 0.0199), and highest Khmer IoU (0.6441 ± 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.

[NLP-44] Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

【速读】：该论文试图解决的问题是：如何理解高性能嵌入模型（embedding models）在嵌入空间中的组织方式及其与模型性能之间的关系。解决方案的关键在于通过系统评估25种当代嵌入模型在五个MTEB任务上的表现，发现独立成分分析（ICA）中配对文本实例的最近邻重叠度和幅值差异与任务性能高度相关（相关系数高达0.97），并揭示不同嵌入任务在局部信息保留和线性程度上存在差异，从而为未来优化嵌入训练目标和条件嵌入（conditional embeddings）提供了理论依据。

链接: https://arxiv.org/abs/2605.22202
作者: Amanda Myntti,Jenna Kanerva,Veronika Laippala,Filip Ginter
机构: University of Turku (Turku大学); ELLIS Institute Finland (芬兰ELLIS研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we show that high-performing embedding models organize their embedding spaces in a consistent way. We evaluate 25 contemporary embedding models on five MTEB tasks spanning four diverse task categories (retrieval, bitext mining, pair classification, and summarization) in both English and multilingual settings, and reveal that nearest-neighbor overlap and magnitude differences in independent component analysis (ICA) between paired text instances strongly correlate (even up to 0.97) with performance on the given task. Ultimately, we show that embedding tasks display varying degrees of linearity and reliance on retention of local information. Our results further the understanding of embeddings, their relation to model performance, and shed light on possible future training objectives and optimizing conditional embeddings.

[NLP-45] Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

【速读】：该论文试图解决的问题是：当前自主代理框架普遍依赖单一大语言模型（LLM）和固定逻辑来调用模块化技能，导致无法有效利用不同模型在不同领域中的互补优势，从而限制了下游任务的性能表现。解决方案的关键在于提出 Maestro（Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration），这是一个基于强化学习（Reinforcement Learning, RL）的编排框架，将异构多模态任务建模为对分层模型-技能注册表的序贯决策过程。Maestro 训练一个轻量级策略网络动态组合冻结的专家模型与两级技能库，在每一步决策是否调用外部专家、选择哪个模型-技能对以及何时终止任务；该策略通过结果导向的强化学习优化，无需逐步标注监督信号。实验表明，仅用 4B 参数的 orchestrator，Maestro 在十个代表性多模态基准上平均准确率达 70.1%，优于 GPT-5（69.3%）和 Gemini-2.5-Pro（68.7%），且其协调策略可泛化至未见过的模型和技能而无需重新训练，展现出强大的适应性和计算效率。

链接: https://arxiv.org/abs/2605.22177
作者: Jinyang Wu,Guocheng Zhai,Ruihan Jin,Yuhao Shen,Zhengxi Lu,Fan Zhang,Haoran Luo,Zheng Lian,Zhengqi Wen,Jianhua Tao
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学); Tongji University (同济大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at this https URL.

[NLP-46] Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

【速读】：该论文试图解决的问题是：多模态语音语言模型（Speech Language Models, SLMs）在处理语音和文本两种模态时，其内部机制在事实知识的编码、存储与检索方面是否存在差异。解决方案的关键在于利用因果中介分析（Causal Mediation Analysis）这一方法，首次将该技术应用于语音-文本模态交互场景，以揭示SLMs中事实关联记忆的内在机制。研究以SpiritLM为例，发现语音到文本的事实召回机制与文本到文本存在显著差异，表明现有文本模型中的知识存储机制并未完全迁移到语音模态，从而为改进语音增强型人工智能系统提供了理论依据和实践方向。

链接: https://arxiv.org/abs/2605.22170
作者: Luca Modica,Filip Landin,Mehrdad Farahani,Livia Qian,Gabriel Skantze,Richard Johansson
机构: Zenseact; Unbox AI; Chalmers University of Technology; University of Gothenburg; KTH Royal Institute of Technology
类目: Computation and Language (cs.CL)
备注: In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics

点击查看摘要

Abstract:In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems. Comments: In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.22170 [cs.CL] (or arXiv:2605.22170v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.22170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-47] Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

【速读】：该论文试图解决的问题是：尽管自演化技能库（self-evolving skill libraries）能够使冻结的大语言模型（frozen LLM）在不更新权重的情况下积累可复用知识，但当前评估表明，由LLM生成的技能仅带来 +0.0 个百分点（pp）的性能提升，而人工 curated 的技能则能带来 +16.2 pp 的显著增益——这说明瓶颈并非技能生成本身，而是技能生命周期管理（lifecycle management）。解决方案的关键在于提出一个名为 Ratchet 的单智能体循环机制，它让冻结的LLM自主完成技能的编写、检索、维护与淘汰。Ratchet 集成了四种候选卫生机制（hygiene mechanisms）：基于结果的淘汰策略、有限活跃容量（bounded active-cap）、元技能编写引导（meta-skill authoring guidance）以及模式规范化（pattern canonicalisation）。实验表明，在 MBPP+ hard-100 数据集上，Ratchet 将 held-out pass@1 从基线 0.258 提升至 0.584（峰值 0.658），相比无技能控制组的微小漂移（+0.002），实现了显著提升；且其核心组件中，淘汰机制和元技能先验为关键支柱，而显式去重（如规范化）可被元技能自身吸收，无需额外设计。

链接: https://arxiv.org/abs/2605.22148
作者: Xing Zhang,Yanwei Cui,Guanghui Wang,Ziyuan Li,Wei Qiu,Bing Zhu,Peiyang He
机构: AWS Generative AI Innovation Center; HSBC Holdings Plc., HSBC Technology Center, China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 2 figures, 6 tables. Extends arXiv:2605.19576 with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)

点击查看摘要

Abstract:Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver +0.0 pp over no-skill baselines while human-curated ones deliver +16.2 pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbfRatchet, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a 0.258 \pm 0.047 baseline to a late-window rolling mean of 0.584 (peak 0.658 \pm 0.042 ) across 100 rounds and 3 seeds, a +0.328 \pm 0.018 rolling-mean gain where the no-skill control drifts at +0.002 \pm 0.005 ; the same recipe transfers to an agentic solver on SWE-bench Verified ( +0.22 peak lift over 20 rounds). Eight ablations (A1–A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

[NLP-48] Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues

【速读】：该论文试图解决的问题是：当前心理支持任务中的数据多基于单轮问答或短对话，难以刻画大学生在校园生活事件中心理困扰的累积、交互与长期演化过程。解决方案的关键在于提出Psy-Chronicle框架，该框架通过构建学期跨度的时间压力事件图（temporal stress event graph）来建模校园压力事件之间的时序顺序与演化依赖关系，并借助学生代理与咨询师代理的交互模拟及结构化记忆整合机制，生成具有跨会话连续性的长时程心理咨询对话。该方法最终产出CPCD数据集和CPCD-Bench评估基准，验证了其在提升模型会话级响应质量和长时记忆召回能力上的有效性，但同时也揭示了事件链组织与因果推理仍是当前长时程心理辅导建模的核心挑战。

链接: https://arxiv.org/abs/2605.22140
作者: Chaogui Gou,Jiarui Liang
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single-turn question answering or short multi-turn dialogues, making it difficult to characterize how college students’ psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy-Chronicle, a structured data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. We generate a semester-spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions. Based on Psy-Chronicle, we construct and open-source CPCD, a Chinese long-horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD-Bench to evaluate models’ long-horizon campus counseling capabilities from three dimensions: session-level response, long-horizon memory recall, and temporal-causal reasoning. Experimental results show that CPCD effectively improves session-level response generation and long-horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal-causal reasoning remain limited, indicating that event-chain organization and causal explanation are key challenges in long-horizon psychological counseling modeling. The related code and data are available at: this https URL

[NLP-49] Efficient Agent ic Reasoning Through Self-Regulated Simulative Planning

【速读】：该论文试图解决的问题是：如何让智能体（agent）在决策过程中有效控制规划（planning）的时机与深度，以避免传统基于反应式策略（reactive policies）的模型因缺乏对规划行为的显式控制而导致推理长度过长、token消耗效率低下且准确性提升不显著的问题。解决方案的关键在于将决策过程分解为三个系统：模拟推理（System II），利用世界模型（world model）进行未来状态预测来支撑深思熟虑的规划；自我调节（System III），通过一个可学习的配置器（configurator）决定何时以及多深地执行规划；以及反应式执行（System I），处理细粒度动作。这种结构化设计实现了跨任务统一的规划能力，并确保规划仅在必要时被触发。作者提出SR²AM（Self-Regulated Simulative Reasoning Agentic LLM）框架，将模拟推理与自我调节作为链式思维（chain-of-thought）中的独立阶段实现，并使用预训练大语言模型（LLM）作为世界模型，最终在数学、科学、表格分析和网络信息检索等任务上达到与数百亿至万亿参数系统相当的性能，同时显著减少推理token消耗（最多降低95.3%）。强化学习进一步提升了平均规划深度（+22.8%），而规划频率仅小幅上升（+2.0%），表明模型学会了更高效地规划而非频繁规划。

链接: https://arxiv.org/abs/2605.22138
作者: Mingkai Deng,Jinyu Hou,Lara Sá Neves,Varad Pimpalkhute,Taylor W. Killian,Zhengzhong Liu,Eric P. Xing
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Code and model artifacts are available at this https URL

点击查看摘要

Abstract:How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR ^2 AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM’s chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

[NLP-50] Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在跨语言场景下存在的文化知识不对称问题，即模型在英文提示下表现最优但存在西方中心偏见，导致其难以准确反映非英语语境下的多元文化知识。解决方案的关键在于提出一种新颖的自监督框架：通过多语言自一致性机制识别不同语言中最具可靠性的文化回答，并结合自评（self-critique）机制将这些知识迁移至表现较弱的语言。实验表明，该方法在BLEnD基准上显著提升了文化对齐度，平均提升5.03%，且完全依赖自生成数据，证明了模型内部潜藏的文化知识可被有效激活并跨语言传播，从而实现更公平、一致的多语言大模型。

链接: https://arxiv.org/abs/2605.22137
作者: Andrew Ivan Soegeng,Patrick Sutanto,Tan Sang Nguyen
机构: SAP; School of Computing, National University of Singapore
类目: Computation and Language (cs.CL)
备注: Accepted to The 1st Workshop on Multilinguality in the Era of Large Language Models

点击查看摘要

Abstract:Although Large Language Models (LLMs) demonstrate strong capabilities across various tasks, they exhibit significant performance discrepancies across languages. While prompting LLMs in English typically yields the highest general performance, it often induces a Western-centric bias, hindering the model’s ability to accurately reflect diverse cultural knowledge. We hypothesize that LLMs already possess rich cultural knowledge embedded within local-language representations, but fail to retrieve it when prompted in English. To bridge this cross-lingual knowledge gap, we propose a novel self-supervised framework. Our method leverages multilingual self-consistency to identify the most reliable cultural responses across languages, combined with a self-critique mechanism to transfer this knowledge to the weaker language. Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data. Ultimately, our work demonstrates that latent cultural knowledge can be successfully surfaced and propagated across languages, enabling more culturally equitable and consistent LLMs.

[NLP-51] A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

【速读】：该论文试图解决在低资源、非拉丁字母文字（如高棉语）场景下，检索增强生成（Retrieval-Augmented Generation, RAG）系统性能不足的问题，特别是在电信领域文档问答任务中的应用。其解决方案的关键在于：首先通过对比三种密集检索模型（BGE-M3、Jina-Embeddings-v3 和 Qwen3-Embedding），确定 BGE-M3 为最优检索器；其次，在此基础上评估五种大语言模型（LLM）作为生成器的性能，发现不同生成器在不同指标（如忠实性、事实准确性、答案相关性等）上表现各异，无单一模型全面领先。这表明，对于高棉语 RAG 系统而言，检索器的选择仍是主要瓶颈，而生成器的优势则取决于具体任务目标（如强调事实准确或语义相似性）。

链接: https://arxiv.org/abs/2605.22099
作者: Sereiwathna Ros,Phannet Pov,Ratanaktepi Chhor,Kimleang Ly,Wan-Sup Cho,Saksonita Khoeurn
机构: Chungbuk National University (忠北国立大学); BigDatalabs Co., Ltd (BigDatalabs有限公司); Ministry of Post and Telecommunications (邮电部)
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure,

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

[NLP-52] ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination LREC2026

【速读】：该论文试图解决阿拉伯语社交媒体文本中种族主义与歧视现象研究缺乏高质量、长期且生态有效（ecologically valid）数据资源的问题。现有研究多依赖Twitter数据，难以反映阿拉伯语平台（如Facebook）上的真实语言使用和用户互动模式。解决方案的关键在于构建ArabDiscrim这一大规模、十年跨度（2014–2024）的阿拉伯语Facebook帖子语料库（共29.3万条），其核心创新包括：（1）整合平台原生互动信号（如点赞、分享、评论和页面元数据），实现语言与受众反应的联合分析；（2）提供200个经人工标注的关键词（含100个种族主义和100个歧视相关词）及其形态学正则表达式家族（每词根平均13种词形变化）；（3）涵盖20个歧视轴线（discrimination axes），用于刻画基于身份的不平等对待维度；（4）明确标注内容归属模式（attribution patterns）。该资源支持弱监督学习、轴向感知采样及平台生态研究，为面向公平性的阿拉伯语自然语言处理（NLP）提供了兼具词汇深度与生态效度的基础工具。

链接: https://arxiv.org/abs/2605.22081
作者: Wajdi Zaghouani,Shimaa Amer Ibrahim,Mabrouka Bessghaier,Houda Bouamor
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026 Main Conference

点击查看摘要

Abstract:We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014–2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.

[NLP-53] Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在生成符合行业标准XML格式及特定领域词汇约束的结构化输出时缺乏公开评估资源的问题，尤其聚焦于建筑信息模型（Building Information Modeling, BIM）中信息交付规范（Information Delivery Specification, IDS）XML的生成能力。解决方案的关键在于构建Ishigaki-IDS-Bench基准测试集，包含166个由专家编写并验证的BIM/IDS实例（源自83个实际场景，涵盖日语和英语），以及对应的黄金标准IDS文件和详细的元数据（如输入格式、语言、对话轮次、IFC版本和建筑领域）。该基准采用基于IDSAuditTool的可处理性（Processability）、结构（Structure）和内容（Content）三重审计，并结合与黄金文件的内容一致性评估，从而全面衡量LLMs生成合规XML的能力。实验表明，尽管最佳模型在零样本条件下达到65.6%的宏F1分数（内容一致性），但仅有27.7%的输出通过内容审计，揭示了当前LLMs在稳定生成满足IDS标准和IFC词汇约束的XML方面仍存在显著挑战。

链接: https://arxiv.org/abs/2605.22079
作者: Ryo Kanazawa,Koyo Hidaka,Teppei Miyamoto,Takayuki Kato,Tomoki Ando,Chenguang Wang,Dayuan Jiang,Naofumi Fujita,Shuhei Saitoh,Atomu Kondo,Koki Arakawa,Daiho Nishioka
机构: ONESTRUCTION Inc.(ONESTRUCTION公司); AWS GenAI Innovation Center(亚马逊云科技生成式AI创新中心)
类目: Computation and Language (cs.CL)
备注: 7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face

点击查看摘要

Abstract:Large language models (LLMs) are widely used to generate structured outputs such as JSON, SQL, and code, yet public resources remain limited for evaluating generation that must simultaneously satisfy industry-standard XML and domain vocabulary constraints. This paper presents Ishigaki-IDS-Bench, a benchmark for evaluating the ability to generate Information Delivery Specification (IDS) XML from Building Information Modeling (BIM) information requirements. The benchmark contains 166 BIM/IDS expert-authored and verified examples created by expanding 83 practical scenarios into Japanese and English, corresponding gold IDS files, and metadata for input format, language, turn setting, IFC version, and construction domain. Its evaluation combines IDSAuditTool-based Processability, Structure, and Content audits with content-agreement evaluation against gold IDS files. In zero-shot evaluation over 10 LLMs, the best model reaches 65.6% macro F1 for content agreement, while only 27.7% of outputs pass the Content audit. These results show that current LLMs can express part of the information requirements as IDS, but still struggle to stably generate XML that satisfies the IDS standard and IFC vocabulary constraints. Ishigaki-IDS-Bench supports comparative evaluation, failure analysis, and the development of constrained structured generation methods that conform to domain standards. We release the evaluation scripts and benchmark data under the CC BY 4.0 license on GitHub and Hugging Face.

[NLP-54] From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

【速读】：该论文旨在解决强化学习从可验证奖励（Reinforcement Learning from Verifiable Rewards, RLVR）在处理复杂推理问题时效率低下的问题，尤其是基于结果的RLVR方法在困难任务中因正确最终答案的样本稀少而难以有效进行样本级信用分配。其解决方案的关键在于提出一种名为SCRL（Subproblem Curriculum Reinforcement Learning）的课程强化学习框架：通过从参考推理链中提取可验证的子问题，并将原问题固定为最后一个子问题，从而将困难问题中的部分进展转化为可验证的学习信号。算法上，SCRL采用子问题级别的归一化机制（subproblem-level normalization），独立对每个子问题位置的奖励进行归一化，并将所得优势值分配给对应的答案片段，实现无需外部评分标准或奖励模型的细粒度信用分配。实证分析表明，子问题课程能够使困难问题脱离梯度死亡区，且随着原始问题难度增加，相对收益更大；在七个数学推理基准测试中，SCRL显著优于现有课程学习基线，在Qwen3-4B-Base模型上平均准确率较GRPO提升+4.1点，在Qwen3-14B-Base上提升+1.9点；尤其在AIME24、AIME25和IMO-Bench上，pass@1和pass@64指标分别提升+3.7和+4.6点，证明其在困难推理问题上的探索能力更强。

链接: https://arxiv.org/abs/2605.22074
作者: Xitai Jiang,Zihan Tang,Wenze Lin,Yang Yue,Shenzhi Wang,Gao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.

[NLP-55] Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

【速读】：该论文试图解决多模态大语言模型（MLLM）在强化学习中因感知与推理脱节而导致的忠实性不足问题，即模型虽能正确识别视觉信息，但在推理过程中却未能忠实利用这些信息，从而限制了其在多模态基准测试中的性能提升。解决方案的关键在于提出Faithful-MR1训练框架，该框架通过两个阶段实现对多模态推理忠实性的显式建模：第一阶段“锚定”（Anchoring）将感知作为预推理子任务，直接监督一个专用的Focus token对图像区域的关注，而非依赖文本描述；第二阶段“强化”（Reinforcing）通过反事实图像干预机制，奖励那些在视觉因果关键区域集中注意力并得出正确答案的推理路径，从而弥合感知与推理之间的断层。

链接: https://arxiv.org/abs/2605.22072
作者: Changyuan Tian,Zhicong Lu,Huaxing Liu,Xiang Wang,Shuai Li,Yu Chen,Wenqian Lv,Zichuan Lin,Juncheng Diao,Deheng Ye
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures, 3 tables. Preprint

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated Focus token’s attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

[NLP-56] Hy-MT2: A Family of Fast Efficient and Powerful Multilingual Translation Models in the Wild

【速读】：该论文旨在解决多语言翻译模型在复杂真实场景中面临的性能、效率与部署灵活性问题，特别是如何在保持高翻译质量的同时实现轻量化和快速推理。其解决方案的关键在于提出Hy-MT2系列模型，通过三种不同规模（1.8B、7B 和 30B-A3B，后者为稀疏专家模型 MoE）支持33种语言的高效翻译，并引入AngelSlim 1.25-bit极端量化技术，在保证精度的前提下显著降低存储需求（如1.8B模型仅需440 MB）并提升推理速度（加速1.5倍），从而满足边缘设备上的实时部署需求；同时，多维评估验证了其在通用、业务、领域特定及指令遵循任务中的卓越表现，尤其在快思考模式下优于DeepSeek-V4-Pro和Kimi K2.6等开源模型，且轻量版超越主流商业API（如微软和 Doubao）。

链接: https://arxiv.org/abs/2605.22064
作者: Mao Zheng,Zheng Li,Tao Chen,Bo Lv,Mingrui Sun,Mingyang Song,Jinlong Song,Hong Huang,Decheng Wu,Hai Wang,Yifan Song,Yanfeng Chen,Guanwei Zhang,Guanghua Yu,Yi Su,Hong Liu,Jinxiang Ou,Keyao Wang,Weile Chen,Haozhao Kuang,Kai Wang,Nuo Chen,Zihao Zheng,Chenhao Wang,Bin Xing,Chengcheng Xu,Tinghao Yu,Binghong Wu,Long Xu,Jiacheng Shi,Yunhao Wang,Baifang Chen,Lei Zhang,Qi Yang,Zhao Wu,Jiacheng Li,Lan Jiang,Lanrui Wang,Kai Zhang,Shuaipeng Li,Zhongzhi Chen,Weixuan Sun,Jiaqi Zhu,An Wang,Wei Li,Jun Xia,Weidong Han,Wutian Yang,Litong Hui,Luoguo Jia,Jiajia Wu,Xinpeng Zhou,Tianxiang Fei
机构: Tencent Hunyuan Team (腾讯混元团队)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. For on-device deployment, with AngelSlim 1.25-bit extreme quantization, the 1.8B model requires only 440 MB of storage and improves inference speed by 1.5x. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall.

[NLP-57] FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

【速读】：该论文试图解决企业级路由系统中专家代理（expert agents）能力描述静态化与实际能力演化不匹配的问题，即代理在提示（prompts）、工具（tools）和模型（models）更新后，其配置的职责描述（profiles）未能同步更新，导致查询分配效率下降。解决方案的关键在于提出 FlyRoute——一个自进化（self-evolving）的代理画像框架，通过实时流量生成能力证据（capability evidence），将成功匹配对存入各代理的成功存储库（success store），并周期性地将这些证据蒸馏为学习到的能力描述；随后将这些描述与 BM25 检索到的成功案例一同注入大语言模型（LLM）路由器中进行决策。为提升数据效率，FlyRoute 引入了一种目标探索策略（targeted exploration policy），结合画像不确定性、BM25 相关性和词汇新颖性，仅对可能匹配但未充分覆盖的代理进行探索，避免冗余采样。实验表明，在仅用每代理 5 个种子查询的情况下，FlyRoute 将零样本 LLM 路由器准确率从 72.57% 提升至 78.04%，在流式处理 7,211 条标注查询后进一步达到 89.83%，显著优于冷启动和零样本基线，且在四个专家领域均实现稳定提升。

链接: https://arxiv.org/abs/2605.22057
作者: Rongjun Li,Ziyu Zhou,Yihang Wu
机构: IT Innovation and Research Center, Huawei Technologies (华为技术有限公司创新与研究中心)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent’s success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.

[NLP-58] HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering IJCAI2026

【速读】：该论文旨在解决持续视觉问答（Continual Visual Question Answering, VQA）中因非平稳数据流导致的灾难性遗忘问题，尤其是现有方法在共享参数更新过程中产生的跨任务干扰（cross-level task interference），从而影响模型对当前任务和物体的准确适应。解决方案的关键在于提出HyLoVQA框架：它通过维护一个抗漂移的记忆库（memory bank），存储视觉对象内容和文本任务的锚点（anchors），并利用当前输入特征动态更新这些锚点；在此基础上，基于检索到的锚点，由超网络（hypernetwork）生成轻量级低秩适配器（Low-Rank Adaptation, LoRA），实现高效参数化调整；同时引入对齐损失（alignment loss），在特征空间中约束语义差异与参数空间中的功能变化一致，确保LoRA适配器始终聚焦于当前任务和对象，从而提升持续学习的稳定性与准确性。

链接: https://arxiv.org/abs/2605.22035
作者: Yiran Wang,Chenyi Xiong,Ziyue Qin,Miao Zhang,Kui Xiao,Zhifei Li
机构: Hubei University (湖北大学); Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University) (湖北省大数据智能分析与应用重点实验室); Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education (教育部智能感知系统与安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.

[NLP-59] LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

【速读】：该论文试图解决多模态大语言模型（MLLMs）在需要细粒度跨模态证据推理时表现不佳的问题，尤其是传统基于文本的思维链（Chain-of-Thought, CoT）将连续的音视频信号压缩为离散标记，导致时间定位能力弱化并使中间推理偏向语言先验。其解决方案的关键在于引入一个统一的潜在空间（unified latent space），以保留密集的感官信息并兼容自回归生成；具体实现上，提出LatentOmni框架，通过特征级监督对齐潜在推理状态与任务相关的感官特征，并利用Omni-Sync Position Embedding（OSPE）保持音视频潜在状态的时间一致性，同时构建了包含35K条音视频交错推理轨迹的数据集LatentOmni-Instruct-35K用于训练和评估。实验表明，LatentOmni在多个音频-视觉推理基准上优于现有开源模型，验证了潜在空间联合推理的有效性。

链接: https://arxiv.org/abs/2605.22012
作者: Yifan Dai,Zhenhua Wu,Bohan Zeng,Daili Hua,Jialing Liu,Bozhou Li,Yuran Wang,Chengzhuo Tong,Hao Liang,Xiaochen Ma,Junbo Niu,Tianyu Guo,Yang Shi,Yue Ding,Yiyan Ji,Bingyin Mei,Yushuo Guan,Yuanxing Zhang,Pengfei Wan,Fangcheng Fu,Wentao Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Kuaishou Technology (快手科技); Peking University (北京大学); HKUST (香港科技大学); CASIA (中国科学院自动化研究所); Nanjing University (南京大学); Renmin University of China (中国人民大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbfLatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbfLatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

[NLP-60] Hallucination as Commitment Failure: Larger LLM s Misfire Despite Knowing the Answer

【速读】：该论文试图解决的问题是：当前主流观点认为幻觉（hallucination）源于模型在生成时缺乏正确答案的知识，即当正确答案不在模型的概率分布中时才会产生错误回答。然而，作者通过引入一种语义层面的答案可用性概念（semantic notion of answer availability），重新检验了这一假设——该概念将表达相同语义答案的不同词元变体聚合起来，从而判断正确答案概念是否已在生成时刻存在。研究发现，在Qwen和Llama系列模型（从0.8B到72B参数规模）中，16%–47%的指令微调模型（Instruct）幻觉发生在正确概念已有显著概率质量的情况下，且该比例随模型规模单调上升。进一步分析表明，正确生成与幻觉生成之间的关键差异并非“正确概念是否存在”，而是其概率分布方式：正确生成会将概率集中在单一表面形式上，而幻觉则将概率分散到多个替代选项中。这种概率分布的“锐化不对称性”（sharpening asymmetry）也存在于多词元生成过程中，并可在生成前的隐藏状态中检测到。因此，论文的核心结论是：指令微调通过增强答案承诺的锐度（sharpness）来提升有用性，但同时也导致自信型幻觉，二者本质上是同一内在倾向的两种表现。

链接: https://arxiv.org/abs/2605.22007
作者: Jewon Yeom,Jaewon Sok,Heejun Kim,Seonghyeon Park,Jeongjae Park,Taesup Kim
机构: Seoul National University (首尔国立大学); Gwangju Institute of Science and Technology (光州科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.

[NLP-61] Check Your LLM s Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldnt Have)

【速读】：该论文试图解决的问题是：如何在不依赖模型推理的情况下，从大型语言模型（LLM）的权重中识别出具有语义解释性的子空间，并揭示这些子空间所反映的训练数据组成、内容偏见及潜在伦理风险。其解决方案的关键在于对模型最后一层线性变换（lm_head）的权重矩阵进行奇异值分解（Singular Value Decomposition, SVD），通过分析左奇异向量与词汇表项之间的映射关系，直接提取出可解释的语义子空间结构。该方法仅需五行PyTorch代码即可完成，无需运行模型推理过程，从而实现了对模型内部表示机制的静态审计。研究进一步引入Vocabulary Cluster Score (VCS) 和 Weighted Projection Score (WPS) 量化子空间一致性与检测异常词元（如glitch token），并发现伦理敏感内容主要源自预训练阶段，而非后训练对齐过程，因此建议将lm_head SVD分析作为模型发布前的标准安全审计步骤。

链接: https://arxiv.org/abs/2605.22005
作者: Hisashi Miyashita
机构: Mgnite Inc.
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We show that singular value decomposition of the lm_head weight matrix of a transformer-based large language model – requiring only five lines of PyTorch and no model inference – reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model’s training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2605.22005 [cs.LG] (or arXiv:2605.22005v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-62] Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems EMNLP2026

【速读】：该论文试图解决的问题是：当前用于保护大语言模型（LLM）代理的注入检测器（Injection Detectors）在面对领域伪装注入攻击（domain camouflaged injection）时存在显著失效问题，即当攻击payload模仿目标文档的领域词汇和权威结构时，标准检测器无法有效识别，导致检测率大幅下降。解决方案的关键在于提出并量化了“伪装检测差距”（Camouflage Detection Gap, CDG），揭示了静态payload与领域伪装payload之间检测性能的巨大差异，并通过系统性实验验证了这一盲区在多个任务、模型家族和安全分类器中普遍存在且统计显著。此外，研究发现多智能体辩论架构会放大静态注入攻击（最高达9.9倍），而更强模型则表现出集体抗性；同时表明单纯增强检测器效果有限，暗示该漏洞本质上源于模型架构对弱模型的固有脆弱性。

链接: https://arxiv.org/abs/2605.22001
作者: Aaditya Pai
机构: Columbia University (哥伦比亚大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 2 tables. Submitted to EMNLP 2026 ARR cycle

点击查看摘要

Abstract:Injection detectors deployed to protect LLM agents are calibrated on static, template-based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p 0.001 for Llama; chi^2 = 17.05, p 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few-shot detectors to dedicated safety classifiers. We further show that multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.

[NLP-63] Echo: Learning from Experience Data via User-Driven Refinement

【速读】：该论文试图解决的问题是：静态的人类数据（human data）在规模扩展上成本高昂且受限于创建者知识边界，而直接使用原始的“经验数据”（experience data）——即智能体与环境交互产生的日志——进行模型训练效率低下，因为这些数据噪声大、信息密度低。解决方案的关键在于提出一个名为 Echo 的通用框架，它能够将原始经验转化为可学习的知识，通过系统性地捕获用户对智能体输出的精细化修正过程（user refinement），从而将低质量的交互记录转化为高质量的训练信号，实现模型的持续优化。实验证明，Echo 在真实代码补全场景中显著提升了接受率，从 25.7% 提高到 35.7%，突破了静态训练带来的性能瓶颈。

链接: https://arxiv.org/abs/2605.21984
作者: Hande Dong,Xiaoyun Liang,Jiarui Yu,Jiayi Lin,Changqing Ai,Feng Liu,Wenjun Zhang,Rongbi Wei,Chaofan Zhu,Linjie Che,Feng Wu,Xin Shen,Dexu Kong,Xiaotian Wang,Qiuyuan Chen,Bingxu An,Yueting Lei,Qiang Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Static “human data” faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from “experience data” - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively “echoing” environmental feedback back into the training loop for model optimization. In today’s agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents’ crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.21984 [cs.AI] (or arXiv:2605.21984v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.21984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-64] SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

【速读】：该论文试图解决大语言模型在执行多跳工具调用（multi-hop tool use）任务时因频繁等待外部工具响应而导致的显著延迟问题。解决方案的关键在于提出一种无损推测（lossless speculation）机制，通过引入更快但可靠性较低的推测工具，在不改变原始轨迹准确性的情况下加速推理过程。其核心创新是SpecHop框架，它采用连续推测策略，维持多个推测线程，异步验证推测结果与目标工具输出的一致性，并动态提交正确分支、回滚错误分支，从而在保持精度的同时有效降低实际运行时间。实验表明，SpecHop可逼近理论最优延迟收益，在检索增强型多跳任务中最高实现40%的延迟减少。

链接: https://arxiv.org/abs/2605.21965
作者: Mehrdad Saberi,Keivan Rezaei,Soheil Feizi
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40% in some settings. Code: this https URL

[NLP-65] Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines EMNLP2026

【速读】：该论文试图解决多模块大语言模型（LLM）智能体在失败时的修复策略问题，即如何有效定位并干预导致性能下降的关键模块。传统做法往往假设最可能引发故障的模块就是最佳修复点，但本文通过实证揭示了一个“诊断悖论”：尽管因果分析一致表明路由模块（routing module，负责选择下一步调用的工具）是主要瓶颈，但在该模块中注入提示级修正示例反而会显著降低性能；相比之下，修复上游的查询重写模块（query-rewriting module）却能稳定提升效果。其关键解决方案在于提出“语言契约假说”（Linguistic Contract hypothesis），认为下游模块会隐式适应上游模块的典型错误分布，因此对瓶颈模块进行直接修正会破坏这种隐式对齐，而上游修正则不会产生此类副作用。作者进一步基于诊断结果构建了代理层面的共适应度量（co-adaptation measure），发现该指标与修复损害程度高度相关——共适应度越高，修复越有害；反之则更安全，这一趋势在三个独立代理家族中均得到验证，为该假说提供了跨代理的初步支持。

链接: https://arxiv.org/abs/2605.21958
作者: Yoon Jeonghun,Kim Dongchan
机构: KAIST (Korea Advanced Institute of Science and Technology), Republic of Korea; NAVER Corp., Seongnam, Republic of Korea; NAVER Corp., Bellevue, WA, USA; OpenAI; Meta; Alibaba; Anthropic; Google
类目: Computation and Language (cs.CL)
备注: Preprint. Under review at EMNLP 2026 (ARR)

点击查看摘要

Abstract:When a multi-module LLM agent fails, the module most responsible for the failure is not necessarily the best place to intervene. We demonstrate this Diagnostic Paradox empirically: causal analysis consistently identifies the routing module – which selects which tool to call next – as the primary bottleneck across three independent agent families. Yet injecting prompt-level correction examples into this module consistently degrades performance, sometimes severely. Patching an upstream query-rewriting module instead reliably improves outcomes. The effect holds with statistical significance on two agent families and directional consistency on a third; alternative repair strategies at the routing module (instruction rewriting, model upgrade) are neutral, confirming that the harm is specific to correction-injection patching. We explain this asymmetry through the Linguistic Contract hypothesis: each downstream module implicitly adapts to its upstream’s characteristic error distribution, so correcting the bottleneck breaks this implicit alignment in a way that upstream corrections do not. We operationalize this via a per-agent co-adaptation measure, derived from diagnosis alone, and show it is consistently associated with patching harm across agent families: higher co-adaptation co-occurs with harm, lower with safety. This trend holds across all three agent families, providing preliminary support for the hypothesis beyond a single-agent observation. Comments: Preprint. Under review at EMNLP 2026 (ARR) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.21958 [cs.CL] (or arXiv:2605.21958v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.21958 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-66] Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

【速读】：该论文旨在解决医疗检索增强生成（RAG）系统在高风险问答场景中因单一“回答或放弃”决策机制导致的评估局限性问题，即当证据存在混合支持、条件依赖或矛盾时，传统方法无法准确反映模型的真实判断能力。其解决方案的关键在于提出“主张选择性认证”（claim-selective certification）框架：将每个响应分解为可验证的主张（claim），基于检索到的证据对每个主张进行评分，并通过一个意图感知的选择器（intent-aware selector）将其映射为完整（full）、部分（partial）、冲突（conflict）或放弃（abstain）四类标签。该方法不仅提升了对非放弃动作的准确性（开发集行动准确率为0.9204，测试集为0.8997），还引入了UCCR（Unsupported Claim Risk）指标来量化证书定义下的未支持主张风险，并通过源缺失反事实切片评估无证据时的放弃行为，从而实现了行动标签预测与证据关联主张选择的解耦。

链接: https://arxiv.org/abs/2605.21949
作者: Shao Kan
机构: Jinglue Technology Development (Nanjing) Co., Ltd.
类目: Computation and Language (cs.CL)
备注: 22 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to full, partial, conflict, abstain. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.

[NLP-67] Planning in the LLM Era: Building for Reliability and Efficiency ICAPS2026

【速读】：该论文试图解决的问题是：如何在大语言模型（LLM）时代提升智能体（intelligent agents）规划能力的可靠性与资源效率。早期方法依赖单次生成计划或结合有限外部搜索的混合策略，存在不完整和不可靠的问题，且在未见问题上表现不佳。论文指出，当前研究趋势已转向在求解构造阶段利用LLM生成可验证的符号求解器（symbolic solvers），从而在推理阶段减少对LLM的依赖，实现更高效、可维护的规划系统。其解决方案的关键在于：将LLM用于规划器生成（planner-generation），而非直接用于规划执行，从而兼顾准确性、效率与可维护性，推动规划领域向更可靠、轻量化的方向发展。

链接: https://arxiv.org/abs/2605.21902
作者: Michael Katz,Harsha Kokel,Kavitha Srinivas,Shirin Sohrabi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published at ICAPS 2026

点击查看摘要

Abstract:Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time – generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.

[NLP-68] oken-weighted Direct Preference Optimization with Attention

【速读】：该论文试图解决的问题是：现有直接偏好优化（Direct Preference Optimization, DPO）方法在训练过程中对响应中的所有标记（token）赋予相同权重，忽略了不同token在人类偏好判断中可能存在的差异性重要性。传统token级偏好优化方法依赖于基于位置的启发式函数或额外训练的模型来估计token权重，存在鲁棒性差和额外训练开销的问题。解决方案的关键在于提出一种基于token加权强化学习（token-weighted RL）的新训练目标——Token-weighted DPO（TwDPO），并进一步设计其具体实现AttentionPO：利用大语言模型（LLM）自身的注意力机制自动估算token权重，从而无需额外训练即可实现内容感知的权重调整。该方法通过两次前向传播完成权重计算，显著提升了效率与性能，在AlpacaEval、MT-Bench和ArenaHard等基准测试中超越了现有偏好优化方法。

链接: https://arxiv.org/abs/2605.21883
作者: Chengyu Huang,Zhuohang Li,Sheng-Yen Chou,Claire Cardie
机构: Cornell University (康奈尔大学); Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) – a novel training objective grounded on token-weighted RL – and AttentionPO – an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.

[NLP-69] Hypergraph as Language

【速读】：该论文试图解决的问题是：当前基于大语言模型（Large Language Models, LLMs）的图结构建模方法仍以成对边（pairwise-edge）为基本单元，难以有效捕捉现实世界中广泛存在的高阶关联关系（high-order associations），而这些关系更自然地体现在超图（hypergraph）结构中。现有方法在处理超图时无法保留多个对象由同一高阶关系共同连接的原生语义，从而限制了其对复杂结构的利用能力。

解决方案的关键在于提出“超图即语言”（Hypergraph as Language）的新视角，并设计 Hyper-Align 框架：通过引入固定形状的混合模板 HIDT-O（Hypergraph Incidence Detail Template with Overview），将高阶关联结构编码为可被基础 LLM 直接消费的超图标记（hypergraph tokens）；同时设计超图邻接投影器（HIP, Hypergraph Incidence Projector），通过显式的语义-结构解耦和顶点与超边之间的双向消息传递，实现原生高阶邻接结构到 LLM 词元空间的有效映射；最终定义了一个统一的“超图即语言”输入协议，使超图标记与文本提示联合输入冻结的基础 LLM 中，支持顶点级和超边级任务的一致问答范式。

链接: https://arxiv.org/abs/2605.21858
作者: Mengqi Lei,Guohuan Xie,Shihui Ying,Shaoyi Du,Jun-Hai Yong,Siqi Li,Yue Gao
机构: Tsinghua University (清华大学); Shanghai University (上海大学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong potential in modeling relational structures. However, existing approaches remain fundamentally graph-centric: they focus on processing pairwise graph structures into tokens that LLMs can understand. In contrast, many real-world relational patterns do not naturally conform to the pairwise-edge assumption, and are better modeled as high-order associations in hypergraphs. For hypergraph structures, existing methods often fail to preserve the native semantics that multiple objects are jointly connected by the same high-order relation, limiting their ability to exploit complex structures. To address this limitation, we put forth the “Hypergraph as Language” perspective and propose Hyper-Align, a hypergraph-native alignment framework for large language models. Hyper-Align compiles the query-object-centered hypergraph context into hypergraph tokens directly consumable by a base LLM. Specifically, we introduce Hypergraph Incidence Detail Template with Overview (HIDT-O), which serializes high-order association structures into a fixed-shape hybrid template combining local incidence details and overview-level summaries. We then design a Hypergraph Incidence Projector (HIP), which maps native high-order incidence structures into the LLM token space through explicit semantic-structural decoupling and bidirectional message passing between vertices and hyperedges. We further define a concrete Hypergraph-as-Language input protocol, which jointly feeds hypergraph tokens and textual prompts into a frozen base LLM, supporting both vertex-level and hyperedge-level tasks under a unified question-answering paradigm. To systematically evaluate different methods in hypergraph structural modeling, we introduce HyperAlign-Bench. Extensive experiments show that Hyper-Align significantly outperforms existing methods across in-domain and zero-shot evaluations.

[NLP-70] ACC: Compiling Agent Trajectories for Long-Context Training

【速读】：该论文旨在解决大语言模型（LLM）在长上下文推理能力上的训练难题，特别是如何高效利用代理（agent）在多轮交互中产生的轨迹数据来增强模型对远距离上下文依赖的建模能力。传统监督微调（SFT）方法仅关注单轮工具选择，忽略了跨轮次的环境观测与工具响应所蕴含的分散证据，导致监督信号缺失。解决方案的关键在于提出代理上下文编译（Agent Context Compilation, ACC），其核心是将来自搜索、软件工程和数据库查询代理的多轮轨迹转化为包含原始问题及跨轮次信息的长上下文问答对，使模型直接学习整合分散证据并作答，无需工具调用。ACC通过显式构建问题与证据之间的依赖关系，在不增加人工标注成本的前提下实现了对长距离上下文推理的有效监督，并可与现有长上下文扩展技术兼容，实验表明其显著提升了MRCR和GraphWalks等基准任务的表现，同时保持了通用能力。

链接: https://arxiv.org/abs/2605.21850
作者: Qisheng Su,Zhen Fang,Shiting Huang,Yu Zeng,Yiming Zhao,Kou Shi,Ziao Zhang,Lin Chen,Zehui Chen,Lijun Wu,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Innovation Institute (上海创新研究院); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.

[NLP-71] Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

【速读】：该论文试图解决的问题是：当前基于字典的可解释方法（如稀疏自编码器和转换器）在分布外（OOD）数据上表现出的忠实性（faithfulness）下降问题，即这些方法在训练时依赖于分布内（ID）激活，但当输入分布发生偏移时，模型使用的内部子空间会发生旋转，导致解释器字典与新的活跃子空间失配。解决方案的关键在于提出几何自适应解释器（Geometry-Adaptive Explainer, GAE），其核心思想是通过仅使用无标签的OOD激活样本，对解释器字典进行几何调整，使其重新对齐到OOD活跃子空间，同时保持原有特征结构不变；该方法无需梯度更新，并被理论证明能有效缩小“忠实性差距”（faithfulness gap），且其额外损失以二阶矩偏移为界呈二次增长，实验证明GAE在多个模型和OOD场景下实现了优于所有基于训练的基线方法的因果忠实性表现。

链接: https://arxiv.org/abs/2605.21849
作者: Sungjun Lim,Heedong Kim,Andrew Lee,Kyungwoo Song
机构: Yonsei University (延世大学); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mechanistic interpretability aims to explain a model’s behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer’s dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer’s dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.

[NLP-72] Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

【速读】：该论文试图解决的问题是如何从死亡调查叙述中提取结构化信息以理解自杀事件的前因条件，特别是针对那些需要语义推理而非简单关键词匹配的复杂情形。解决方案的关键在于提出了一种“复杂度评分”（Complexity Score）算法，该算法通过分析编码手册结构来预测在何种情况下使用包含完整编码指南的详细提示（full coding guidelines）比仅用名称提示（name-only prompts）更有效；进而构建了一个混合方法，根据具体情境动态选择最优提示策略。实验表明，在国家暴力死亡报告系统（NVDRS）中的25个高推理复杂度情境下，大型语言模型（LLMs）在低频情境中显著优于微调后的RoBERTa模型，且该框架在GPT-5.2、Gemini 2.5 Pro和Llama-3 70B等前沿大模型上均表现出一致性能模式，验证了“LLM处理罕见且复杂情形 + 微调模型处理常见情形”的混合架构的有效性。

链接: https://arxiv.org/abs/2605.21845
作者: Geoffrey Martin,Xuan Zhong Feng,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE ICHI 2026

点击查看摘要

Abstract:Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score’’ algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

[NLP-73] Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

【速读】：该论文试图解决标准Transformer注意力机制中一个关键问题：即对所有输入token一视同仁地计算查询与键之间的成对相似度，忽略了token间内在信息含量的差异。在湍流流体动力学中，相干结构（coherent structures）虽处于背景混沌之中，却承载了绝大部分能量并主导输运过程；类比地，作者提出，在语言建模任务中，信息密度高的token（如句法核心词、话语标记等）应获得更强的关注权重，而低信息量token（如功能词、重复模式）则应被弱化。为此，论文提出Energy-Gated Attention (EGA)，其核心创新在于通过一个可学习的线性投影提取键嵌入的谱能量（spectral energy），并据此动态调节值（value）聚合过程——这一机制实现了无需额外计算开销的注意力门控。实验表明，EGA在TinyShakespeare和Penn Treebank上均实现显著性能提升（验证损失分别下降0.103和0.101），且仅引入约0.26%的参数（12,480个）。此外，系统性消融分析揭示最优能量方向是数据自适应且非正弦性的，同时发现学习到的能量阈值收敛于τ ≈ 0.35，对应英文文本中约36%的token具有高于平均谱能量，这与内容词占比稳定一致，体现了语言结构的本质特性。

链接: https://arxiv.org/abs/2605.21842
作者: Athanasios Zeris
机构: Independent Researcher(独立研究员)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures – the energetically dominant, spatially organized patterns that persist amid background chaos – carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned linear projection that discovers the dominant spectral mode of the embedding field. On TinyShakespeare, EGA achieves +0.103 validation loss improvement with only 12,480 additional parameters (0.26% overhead) and no measurable computational cost. The result is consistent on Penn Treebank (+0.101), demonstrating dataset independence. A systematic ablation across three wavelet families (fixed Morlet, Daubechies db2/db4, and a parametric Morlet) establishes that fixed structured bases are suboptimal – the optimal energy direction is data-adaptive and non-sinusoidal – while identifying learned wavelet packets as a promising open direction. The learned energy threshold converges to tau ~= 0.35 independently of initialization, corresponding to the fraction (~36%) of tokens carrying above-average spectral energy in English text, a stable linguistic property consistent with the fraction of content words in running English text.

[NLP-74] Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

【速读】：该论文试图解决的问题是：语言模型在需要生成数值动作的场景下，是否能够保留强度词（intensity words）的序数意义（ordinal meaning）。具体而言，研究关注的是当模型被要求根据自然语言指令分配资源时，其对从“slightly”到“drastically”共10个程度副词的数值响应是否保持语义上的有序性。解决方案的关键在于构建一个受控的资源分配环境，通过固定温度参数（T=0.0和T=0.7）进行6,620次实验，系统性地隔离强度词与初始系统状态的影响，并量化模型输出的分布特征与排序关系。结果表明，模型将10个强度词压缩为5个离散的中位数值输出，且其数值响应高度依赖于当前系统状态，在接近可行性边界时表现出三种行为模式（弱词微调、强词回避、"drastically"推至局部上限），并显示出显著的状态依赖性和非连续性，说明语言模型对模糊强度词的数值解释并非线性或稳定，而是受任务上下文和约束条件强烈调节。

链接: https://arxiv.org/abs/2605.21827
作者: Daniel Tabach(Georgia Institute of Technology)
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 figures, 2 tables, 16 references

点击查看摘要

Abstract:Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model’s numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model’s numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries. Comments: 9 figures, 2 tables, 16 references Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.21827 [cs.CL] (or arXiv:2605.21827v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.21827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-75] When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

【速读】：该论文试图解决的问题是：当前医学大语言模型（Large Language Models, LLMs）主要基于指南驱动的常见病例进行训练和评估，难以应对真实世界中罕见、复杂且未被指南覆盖的临床场景（即“长尾医疗”问题）。现有评估方法多依赖于对记忆内容的回忆或多项选择题测试，无法有效衡量模型在缺乏标准化路径时的证据推理能力。解决方案的关键在于提出一个名为OGCaReBench的新基准，这是一个以自由文本回答为导向、基于已发表病例报告构建的检索型评测集，专门用于评估LLMs在罕见病例中进行开放性、证据驱动的临床推理能力。实验表明，即使是最先进的模型（如GPT-5.2）在未增强条件下仅能正确回答56%的问题，而通过引入外部医学文献检索后，性能可提升至82%，凸显了“证据接地”（evidence-grounding）在实现可靠临床推理中的核心作用。

链接: https://arxiv.org/abs/2605.21807
作者: Doeun Lee,Muge Zhang,Yi Yu,Ashish Manne,Stephen Koesters,Frank Wen,Brady Buchanan,Lynda Villagomez,Oluwatoba Moninuola,James Lim,Kathryn Tobin,Andrew Srisuwananukorn,Ping Zhang,Sachin Kumar
机构: The Ohio State University; The Ohio State University Wexner Medical Center; University of Chicago Medical Center
类目: Computation and Language (cs.CL)
备注: 34 pages, 20 figures

点击查看摘要

Abstract:Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

[NLP-76] Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

【速读】：该论文试图解决后训练（post-training）阶段中，模型生成输出的不确定性信号难以有效区分信息性与噪声信号的问题，尤其在无批评者（critic-free）框架下，现有方法依赖响应级熵估计来调节群体优化策略（如GRPO），但其优化动态不明确且性能不稳定。解决方案的关键在于提出一种原理性的不确定性建模框架——几何感知校准策略优化（GCPO），该框架通过两个核心创新弥补当前熵估计器的两大缺陷：各向异性差距（anisotropic gap）和校准差距（calibration gap）。具体而言，GCPO引入几何感知度量以捕捉语义层面的分歧，并结合基于奖励的校准机制使不确定性信号与学习信号强度对齐，从而更准确地表征梯度方差并提升优化稳定性。实验表明，该方法能更忠实追踪梯度变化并在多个基准上一致提升后训练性能，强调了设计与优化动态对齐的不确定性信号的重要性。

链接: https://arxiv.org/abs/2605.21801
作者: Zheyuan Zhang,Kaiwen Shi,Han Bao,Zehong Wang,Tianyi Ma,Yanfang Ye
机构: University of Notre Dame (圣母大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

[NLP-77] MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue LREC2026

【速读】：该论文旨在解决生成式 AI（Generative AI）在动态3D环境中进行对话式指代理解（referential communication）时面临的挑战，尤其是现有视觉语言模型（VLMs）难以处理多轮对话中出现的模糊表达问题。其解决方案的关键在于提出一个包含6.7小时第一人称VR交互数据的基准测试集（涵盖同步语音、动作、注视和3D场景几何信息），以及一个两阶段的指代表达解析流水线：首先通过上下文重写（contextual rewriting）明确模糊语义，再进行视觉定位。实验表明，该方法在指代表达上的准确率提升11–22个百分点，尤其在代词类表达上，纯检测器（GroundingDINO）经重写后达到56.7%的准确率，接近端到端基线的两倍，证明将语言推理与视觉感知解耦优于传统联合建模方式。

链接: https://arxiv.org/abs/2605.21796
作者: Anna Deichler,Jim O’Regan,Fethiye Irmak Dogan,Lubos Marcinek,Anna Klezovich,Iolanda Leite,Jonas Beskow
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

点击查看摘要

Abstract:Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

[NLP-78] Residual Skill Optimization for Text-to-SQL Ensembles

【速读】：该论文试图解决文本到SQL（Text-to-SQL）集成模型中候选SQL语句多样性不足的问题，这限制了其性能上限（即Pass@K指标）。现有方法通过随机解码或提示变体来引入多样性，但往往导致候选集被相关性高的错误主导。解决方案的关键在于提出DivSkill-SQL框架——一种无需模型微调的残差技能优化机制：每个新技能在当前技能集合失败的样本上进行优化，从而理论上最大化其对Pass@K的边际贡献。实验表明，该方法在Spider2-Lite数据集上相较最强基线分别提升了+11.1点（Snowflake）和+8.3点（BigQuery），且在不同基础模型（Opus-4.6 和 GPT-5.4）上均表现稳定；此外，技能可跨方言迁移且适用于不同任务形式（如BIRD-Critic），并显著减少幻觉式模式引用与函数调用（最多减少3倍），说明性能提升源于真正互补的可靠技能而非表面形式变化。

链接: https://arxiv.org/abs/2605.21792
作者: Jiongli Zhu,Haoquan Guan,Parjanya Prajakta Prashant,Nikki Lijing Kuang,Seyedeh Baharan Khatami,Canwen Xu,Xiaodong Yu,Yingyu Lin,Zhewei Yao,Yuxiong He,Babak Salimi
机构: University of California, San Diego (加州大学圣地亚哥分校); Snowflake AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

[NLP-79] Reflective Prompt Tuning through Language Model Function-Calling

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在提示工程（prompt engineering）中面临的手动设计成本高、对格式和措辞敏感、且难以系统性识别与修复错误模式的问题。其解决方案的关键在于提出一种名为“反思式提示微调”（Reflective Prompt Tuning, RPT）的框架，该框架利用LLM的功能调用机制模拟人类提示工程师的迭代优化流程：通过一个诊断函数对整个优化数据集进行评估，生成结构化的失败模式报告，并结合历史诊断记忆进行有针对性的提示修订；同时引入置信度感知机制，基于诊断反馈中的校准信号实现更稳健的提示选择。实验表明，RPT在多跳推理和数学推理任务中表现尤为突出，能有效提升任务性能与置信度校准水平。

链接: https://arxiv.org/abs/2605.21781
作者: Farima Fatahi Bayat,Moin Aminnaseri,Pouya Pezeshkpour,Estevam Hruschka
机构: Megagon Labs
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

[NLP-80] PromptNCE: Pointwise Mutual Information Predictions Using Only LLM s and Contrastive Estimation Prompts

【速读】：该论文试图解决在低数据场景下估计文本互信息（Mutual Information, MI）时依赖任务特定判别器（critic）的问题，从而限制了其应用范围。解决方案的关键在于利用大语言模型（Large Language Models, LLMs）进行零样本（zero-shot）点互信息（Pointwise Mutual Information, PMI）估计，仅通过提示（prompts）和生成的概率来实现。作者提出了一种名为 PromptNCE 的新方法，其核心创新是将条件概率估计建模为对比学习任务，并显式引入“OTHER”类别作为候选集的补充，理论上可恢复真实条件概率 $ P(y|x) $ 而非仅对已列候选项进行排序，从而将对比提示转化为通用零样本概率估计器。实验表明，PromptNCE 在三个公开数据集上均优于其他五种基于信息论提示的方法，与人工标注的 PMI 相关性最高达 Spearman 0.82，且在计算机科学教育领域展示了其在低数据场景下对学生知识总结评分的实际应用潜力。

链接: https://arxiv.org/abs/2605.21776
作者: Juliette Woodrow,Chris Piech
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

[NLP-81] RankJudge: A Multi-Turn LLM -as-a-Judge Synthetic Benchmark Generator

【速读】：该论文旨在解决多轮对话场景下，如何有效评估生成式 AI（Generative AI）模型输出质量的问题。现有基于大语言模型（LLM）作为评判者（LLM-as-a-judge）的基准测试主要聚焦于简单的问答任务，难以适配复杂多轮对话的实际需求。其解决方案的关键在于提出 RankJudge——一个面向多轮对话且基于参考文档的基准生成器，通过构造成对对话样本（其中仅一轮存在单一缺陷），实现无歧义的优劣标注，并精确隔离失败类别至单轮对话，从而支持严格的联合正确性判定标准。实验覆盖机器学习、生物医学和金融三个领域，对21个前沿LLM评判者进行排序，并利用Bradley-Terry模型量化其性能；同时引入难度评分机制动态筛选高信噪比样本，显著降低标签噪声，验证了模型排名在部分可观测性、粗粒度正确性标准及替代随机游走算法下的稳定性。

链接: https://arxiv.org/abs/2605.21748
作者: Zhenwei Tang,Zhaoyan Liu,Rasa Hosseinzadeh,Tongzi Wu,Keyvan Golestan,Jesse C. Cresswell
机构: Layer 6 AI; University of Toronto
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

[NLP-82] BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

【速读】：该论文试图解决图像描述（image captioning）评估中的关键挑战，即现有评估指标在面对长文本、上下文丰富的描述时存在不足：一方面，基于大语言模型（LLM）的判别式评估方法计算成本过高；另一方面，基于CLIP的编码器方法受限于严格的token长度、缺乏细粒度敏感性以及对“词袋”式表示的组合泛化能力不足。解决方案的关键在于提出一种轻量级的交叉编码器（cross-encoder）学习型评估指标，其初始权重来自视觉问答（VQA）模型检查点，从而在保持高性能的同时实现计算效率；同时，通过精心设计的数据混合训练策略和基于LLM的对抗性数据增强技术，显著提升了模型对细粒度视觉-语言错误的敏感性，并引入了一个新的基准测试集以全面评估不同场景下的描述质量。实验表明，该方法在性能上达到当前最优水平，且具备大规模基准测试、质量感知解码及奖励引导等实际应用所需的高效性。

链接: https://arxiv.org/abs/2605.21728
作者: Gonçalo Gomes,Bruno Martins,Chrysoula Zerva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.‘’ We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

[NLP-83] Probabilistic Attribution For Large Language Models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在生成文本时缺乏可解释性的问题，尤其是如何量化每个token对最终响应的贡献。其解决方案的关键在于将LLM视为随机过程（stochastic processes），利用贝叶斯定理反转下一token的对数概率，从而构建一个与模型计算结构无关的概率归因度量（probabilistic token attribution measure）。该度量通过计算给定提示（prompt）下响应的条件概率与其移除某一token后的条件概率之比的对数来实现，同时结合单个token在上下文条件下的熵值分析，揭示模型行为中的不确定性、敏感性和稳定性特征。这一方法提升了模型输出的可解释性，并帮助用户聚焦于生成过程中不确定或不稳定的区域。

链接: https://arxiv.org/abs/2605.21726
作者: Shilpika Shilpika,Carlo Graziani,Bethany Lusch,Venkatram Vishwanath,Michael E. Papka
机构: Argonne Leadership Computing Facility, Argonne National Laboratory; Mathematics and Computer Science Division, Argonne National Laboratory; Department of Computer Science, University of Illinois Chicago
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 13 figures

点击查看摘要

Abstract:The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

[NLP-84] Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

【速读】：该论文试图解决的问题是：如何区分一篇同行评审（peer review）是由人类撰写还是由生成式 AI 模型生成。传统方法仅依赖文本特征进行作者身份识别，但本文指出，仅靠文本不足以准确判断，还需结合评审中表达的思想、判断和主张（claims）。解决方案的关键在于提出 Sem-Detect 方法，该方法通过将文本特征与基于主张层面的语义分析相结合，实现更精准的检测。具体而言，Sem-Detect 将目标评审与同一论文的多篇 AI 生成评审进行对比，利用不同 AI 模型倾向于产生相似观点、而人类评审更具多样性和独特性的观察结果，从而有效区分纯 AI 生成评审与真实人类评审，包括那些经大语言模型（LLM）润色但仍保留人类判断痕迹的评审。在包含超过 20,000 篇 ICLR 和 NeurIPS 会议评审的数据集上，Sem-Detect 在二分类任务中相较最强基线提升了 25.5% 的 TPR@0.1% FPR，并在三分类场景下验证了 LLM 优化后的评审仍保持显著的人类语义特征，误判率低于 3.5%。

链接: https://arxiv.org/abs/2605.21713
作者: André V. Duarte,Brian Tufts,Aditya Oke,Fei Fang,Arlindo L. Oliveira,Lei Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

[NLP-85] Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

【速读】：该论文旨在解决交通运输安全分析中数据访问不平等与技术门槛过高问题，即地方机构、学校委员会及居民虽有安全关切，却因缺乏数据检索、筛选、制图和分析能力而难以参与安全规划。其解决方案的关键在于构建一个基于模式（schema-grounded）的自然语言接口，利用大语言模型（Large Language Model, LLM）理解用户意图，同时通过规则校验层将自然语言查询转化为结构化的语义框架，并编译为带类型约束的有向无环图（typed directed acyclic graph）来执行空间操作，最终在PostGIS数据库上实现确定性、可复现的结果输出。这种“语言解释”与“确定性执行”分离的设计，在保障结果可靠性与可审查性的前提下显著降低了使用门槛，实证表明该框架能有效处理复杂查询并纠正29%的输入错误，为公共部门中可信人工智能赋能交通安全管理提供了可行路径。

链接: https://arxiv.org/abs/2605.21712
作者: Mahdi Azhdari,Eric J. Gonzales
机构: University of Massachusetts (马萨诸塞大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 5 figures

点击查看摘要

Abstract:Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites create a gap between analytical tools central to safety planning and the practitioners able to use them. Local agencies, school committees, and residents may have safety concerns but limited capacity to retrieve, filter, map, and analyze relevant data. Generative AI offers a way to narrow this divide, but its public-sector use raises questions about reliability, reproducibility, and governance. This paper presents a schema-grounded natural language interface for transportation safety analysis, using a large language model (LLM) to interpret user intent while preserving deterministic, reviewable execution against an authoritative database. User queries are translated into structured semantic frames, validated by a rule-based layer, compiled into a typed directed acyclic graph of spatial operations, and executed against a PostGIS database. This bounded design separates language interpretation from deterministic execution, keeping results reproducible and schema-grounded while removing access barriers. The framework is evaluated using a statewide Massachusetts transportation safety database integrating crash records, roadway attributes, and geospatial layers including schools, bus stops, crosswalks, and municipal boundaries. All queries executed successfully; the validation layer corrects errors in 29% of evaluation queries, reflecting the gap between flexible natural language and strict schema-grounded requirements. The results suggest that combining natural language accessibility with deterministic execution is a practical direction for broadening access to transportation safety data, with implications for trustworthy AI in public-sector planning.

[NLP-86] X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

【速读】：该论文旨在解决跨分词器知识蒸馏（cross-tokenizer knowledge distillation）中因词汇表不兼容导致的性能下降问题，尤其是现有基于logits的方法在处理全输出分布时存在的两个关键缺陷：一是“非常用词失败”（uncommon-token failure），即关键token因分词不匹配被排除训练，造成性能显著下滑；二是过度保守的匹配策略，限制了表面形式不同但语义相近token的有效对齐。解决方案的核心在于提出X-Token方法，通过两种互补的损失函数实现：P-KL损失消除严格分组，利用从分词规则初始化的稀疏投影矩阵W将学生模型分布与教师分布对齐，以缓解非常用词失效问题；H-KL损失保留混合结构，但放宽一对一匹配约束，允许每个学生token与其在W下的最优教师映射对齐，从而提升近似等价token的匹配灵活性。二者共享同一投影矩阵W，可自然扩展至多教师场景，实验表明在Llama-3.2-1B上优于当前最优方法GOLD，使用Qwen3-4B教师提升+3.82点，Phi-4-Mini教师提升+0.5点，且双教师配置进一步带来+1.3点增益。

链接: https://arxiv.org/abs/2605.21699
作者: Sharath Turuvekere Sreenivas,Adithyakrishna Venkatesh Hanasoge,Mingyu Yang,Ali Taghibakhshi,Saurav Muralidharan,Ashwath Aithal,Pavlo Molchanov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full ‘dark knowledge’ in the teacher’s distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama’s 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student’s distribution with the teacher’s via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

[NLP-87] Value-Gradient Hypothesis of RL for LLM s

【速读】：该论文试图解决的问题是：为什么无评论者（critic-free）的强化学习方法（如PPO和GRPO）在预训练语言模型（LLM）微调中表现良好，以及在何种情况下这些方法能带来最大收益。其解决方案的关键在于提出了一种“价值梯度视角”（value-gradient perspective）来解释无评论者RL的更新机制——首先，在可微分轨迹和加性噪声参数化假设下，证明了策略更新在期望意义上等价于价值梯度更新；其次，针对离散Transformer策略，发现通过注意力机制的自动微分能够产生近似价值信号的“成本状态”（costates），且误差由采样间隔和策略熵控制。这一理论框架进一步将强化学习的影响分解为价值梯度信号强度与可实现奖励空间（reachable reward headroom）两部分，从而给出一个判断RL何时最有效的准则。

链接: https://arxiv.org/abs/2605.21654
作者: Arip Asadulaev,Daniil Ognev,Karim Salta,Martin Takac
机构: MBZUAI; Independent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

[NLP-88] Amplifying Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

【速读】：该论文试图解决的问题是：当前生成式 AI 文本检测器（AI text detectors）是否真正构建了“AI 与人类文本”的边界，还是仅仅放大了一个预训练模型中已存在的典型性轴（typicality axis）。研究发现，这些检测器并非通过学习区分 AI 与人类文本的特征来实现判别，而是利用了原始编码器（raw encoder）中自然存在的、基于群体平均向量（centroid）的典型性差异——即 AI 文本和人类文本在嵌入空间中的分布差异。解决方案的关键在于：使用未经任务监督的原始编码器，通过投影到 AI 群体中心与人类文本群体中心（如 HC3）之间的向量上，即可达到接近甚至超越微调后检测器的判别性能（AUROC 达到 0.806–0.944），且无需额外训练；同时，这种机制具有可解释性和泛化能力，例如通过闭合形式的雅可比预测器（closed-form Jacobian predictor）精确操控该轴方向，显著提升检测器性能（如 ELECTRA-CE 的 TPR 从 0.000 提升至 0.904，FPR=1%），并验证了其跨架构、跨模型的一致性（cos 相似度达 0.74–1.00）。这一结果表明，AI 文本检测的本质不是识别特定“AI 特征”，而是对典型性差异的敏感放大。

链接: https://arxiv.org/abs/2605.21653
作者: Alexander Smirnov
机构: University College London (伦敦大学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) – a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes – text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor – agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| = 0.0081), and = 97% of the LoRA-full-FT bias gap on ELECTRA is calibration shift, not learned representation – the central claim’s prediction confirmed.

[NLP-89] EntmaxKV: Support-Aware Decoding for Entmax Attention

【速读】：该论文旨在解决长上下文生成过程中KV缓存（KV-cache）内存访问瓶颈问题，即随着上下文长度线性增长，每次生成token时需访问整个缓存，导致计算和内存开销剧增。传统稀疏解码方法通过选择部分token或页来减少访问量，但仅适用于softmax注意力机制——因其稠密尾部特性会导致截断时丢失非零概率质量，从而引入误差。本文提出EntmaxKV框架，利用α-entmax函数的精确零值特性（exact zeros），将稀疏解码从稠密尾部近似转化为支持集恢复（support recovery）：只要候选集包含entmax的支持集，解码即可保持精确性。其关键创新在于在加载KV页前就利用查询感知的页面评分、支持感知的候选选择以及稀疏entmax注意力机制，实现真正的提前稀疏化。作者进一步定义截断误差δ（丢弃的概率质量），证明输出误差受δ控制且在支持集完整恢复时趋于零，并设计高斯感知的entmax选择器，基于轻量级页面统计估算阈值以自适应调整预算。实验表明，EntmaxKV在相同KV缓存预算下比softmax稀疏解码更少丢弃概率质量、保留更多支持token并降低输出误差；在长上下文语言建模任务中，其性能接近全缓存entmax，同时仅使用极小比例的KV缓存，相较全注意力基线最高提速达5.43倍（相比entmax）和3.36倍（相比softmax）。

链接: https://arxiv.org/abs/2605.21649
作者: Gonçalo Duarte,Miguel Couceiro,Marcos V. Treviso
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, \alpha -entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass \delta , showing that output error is controlled by \delta and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to 3.36\times (softmax) and 5.43\times (entmax) speedup over full attention baselines at 1M context length. Code available at: this https URL.

[NLP-90] Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly CVPR2026

【速读】：该论文试图解决现有视频理解基准在细粒度时空推理能力评估上的不足问题，特别是针对家具组装等实际场景中所需的步骤级、精细化的时空理解任务缺乏有效评测手段的问题。解决方案的关键在于提出一个新的基准——Flat-Pack Bench，该基准聚焦于家具组装任务，通过多选题与视觉提示相结合的方式，评估大型视觉语言模型（LVLMs）在装配动作时序排序、装配状态时间定位、部件匹配理解及跟踪等方面的性能，从而系统性地揭示当前模型在利用视频时序信息、空间交互理解（如物理接触）和长期跟踪方面的局限性。

链接: https://arxiv.org/abs/2605.21625
作者: Aditya Chetan,Eric Cai,Peeyush Kushwaha,Bharath Raj Nagoor Kani,Utkarsh Mall,Qianqian Wang,Noah Snavely,Bharath Hariharan
机构: Cornell University (康奈尔大学); Cornell Tech (康奈尔技术学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR 2026

点击查看摘要

Abstract:The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

[NLP-91] CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

【速读】：该论文试图解决的问题是：当前大型语言模型（LLM）在青少年数字环境中的应用缺乏针对其发展特性的安全机制，现有方法多基于成人中心的安全规范，以拒绝式抑制为主，导致对话中断、指导缺失，并忽视青少年在认知与情感上的独特脆弱性。解决方案的关键在于提出一种名为“Critique-and-Revise-for-Teenagers”（CR4T）的模型无关防护框架，该框架通过轻量级风险检测与领域条件化重写相结合的方式，将不安全或拒绝型输出重构为适合青少年发展的、导向性更强的回应，在保留原始意图的同时移除风险放大内容、减少不必要的对话终止，并引入符合发展阶段的引导策略。实验表明，这种选择性响应重构显著降低了不安全和拒绝导向的结果，同时避免对合理交互的过度干预，为面向青少年的LLM系统提供了一种更以人为中心的安全范式。

链接: https://arxiv.org/abs/2605.21609
作者: Heajun An,Qi Zhang,Vedanth Achanta,Jin-Hee Cho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.

[NLP-92] From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment ICML26

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在特定领域适配过程中存在的高数据和计算开销问题。传统方法通常将数据选择与参数高效微调视为独立过程，而本文通过实证分析发现二者存在内在耦合关系。其解决方案的核心在于提出“强映射假设”（Strong Map Hypothesis）：任务特异性的注意力头（attention heads）构成关键子集，能够解锁特定数据模式。基于此，作者设计了从参数到数据（From Parameters to Data, P2D）的统一框架，利用这些任务敏感的注意力头作为双重指引，同时实现样本挖掘与结构剪枝。该框架引入对齐效率比（Alignment Efficiency Ratio, AER）以量化整体流水线成本，并通过轻量级代理机制识别关键头，进而筛选高相关性数据，形成协同优化流程。实验表明，仅更新10%的注意力头并使用10%的数据，P2D即可在性能上较强基线提升8.3个百分点，并实现7.0倍端到端训练速度加速，验证了参数与数据精确同步可消除冗余，为高效对齐提供了新范式。

链接: https://arxiv.org/abs/2605.21558
作者: Hao Chen,Qi Zhang,Liyao Li,Zhanming Shen,Wentao Ye,Lirong Gao,Ningtao Wang,Xing Fu,Xiaoyu Shen,Junbo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted@ICML26, 28 pages, 11 figures, 26 tables

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

[NLP-93] Detecting Synthetic Political Narratives in Cross-Platform Social Media Discourse

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）催生的合成政治叙事（Synthetic Political Narratives）在多平台传播中难以检测的问题。其核心挑战在于如何识别那些通过语义协调、时间同步和重复修辞策略进行规模化传播的政治信息，而传统单一指标无法有效捕捉此类协同行为。解决方案的关键在于提出一个跨平台的合成叙事协调评分（Synthetic Narrative Coordination Score, SNC©，由四个互补的协调信号构成：词汇多样性（D©）、时间突发性（B©）、修辞重复性（R©）和语义同质化（H©））。实证研究表明，SNC©能够显著提升检测的鲁棒性和可解释性，尤其在区分高语义同质但低协同性的内容（如俄语频道Rybar）与高度协同的虚假叙事（如IntelSlava）方面表现突出，证明了多维指标融合的重要性。

链接: https://arxiv.org/abs/2605.21540
作者: Despoina Antonakaki,Sotiris Ioannidis
机构: 1. University of Crete (克里特大学); 2. Institute of Computer Science, Foundation for Research and Technology - Hellas (希腊国家研究基金会-伯罗奔尼撒研究所)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The proliferation of large language models has introduced a new paradigm of synthetic political communication in which narratives may be generated, semantically coordinated, and strategically disseminated across platforms at scale. We present a cross-platform framework for detecting synthetic political narratives using four coordination signals – lexical diversity D©, temporal burstiness B©, rhetorical repetition R©, and semantic homogenization H© – combined into a Synthetic Narrative Coordination Score SNC©. We apply the framework to a corpus of 353,223 records spanning six geopolitical event windows collected from six Telegram channels and nine Reddit communities (2023–2026). Results show that IntelSlava exhibits the lowest lexical diversity (MATTR 0.52–0.54), the highest burstiness (B=+0.48 to +0.73), and the highest rhetorical overlap with peer channels (Jaccard 0.12), ranking first in the composite SNC© on four of six event windows (SNC 0.45–0.60). Rybar ranks last on all windows despite its high semantic homogenization, because its Russian-language output yields high lexical diversity and near-zero rhetorical Jaccard with English-language channels – demonstrating that no single indicator is sufficient for coordination detection. Multi-dimensional SNC© scoring provides a more robust and interpretable signal than any individual metric. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2605.21540 [cs.SI] (or arXiv:2605.21540v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2605.21540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-94] HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine ALT

【速读】：该论文试图解决的问题是：当前前沿语言模型在临床工作流中的部署速度远超其安全评估基础设施的发展，而静态医疗问答基准无法捕捉急诊医学中真正关键的失效模式，如轨迹级安全崩溃、工具误用以及在持续临床压力下的能力退化。解决方案的关键在于提出 HealthCraft——首个公开的强化学习环境，它基于 FHIR R4 的世界状态（包含14类实体和3,987个种子实体），暴露24个MCP工具，并采用双层评分标准：一旦违反任何安全关键准则即立即归零奖励，从而实现对轨迹级安全性的显式建模与量化评估。该环境共包含195项任务（扩展至205项）和2,255个二元评判标准（其中515个为安全关键），并通过确定性LLM裁判叠加和负类烟雾测试验证了奖励信号的可靠性与非可诱导性，为未来训练奖励机制的设计提供了基准与工具。

链接: https://arxiv.org/abs/2605.21496
作者: Brandon Dent
机构: GOATnote Inc. (GOATnote公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: this https URL

点击查看摘要

Abstract:Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model “looks stronger,” evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

[NLP-95] aching Language Models to Forecast Research Success Through Comparative Idea Evaluation ACL2026

【速读】：该论文试图解决的问题是：在生成式 AI (Generative AI) 加速科学研究的过程中，如何高效评估和筛选大量由模型生成的科研假设，而无需进行耗时且资源密集的实验验证。解决方案的关键在于构建一个基于客观结果的对比性实证预测任务（comparative empirical forecasting），并利用监督微调（SFT）与可验证奖励的强化学习（RLVR）方法训练小规模语言模型（8B参数）以学习推理路径，从而在无实验前提下准确预测两个候选研究想法中哪一个更可能取得更好的基准性能。实验表明，该方法显著优于大模型（如GPT-5），并在跨域和分布外测试中展现出鲁棒性和可迁移性，证明了小型语言模型作为高效、客观科研验证器的可行性与潜力。

链接: https://arxiv.org/abs/2605.21491
作者: Srujan P Mule,Aniketh Garikaparthi,Manasi Patwardhan
机构: IISER Pune (印度科学教育研究所普内分校); TCS Research (TCS研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

信息检索

[IR-0] Diversed Model Discovery via Structured Table Discovery

链接: https://arxiv.org/abs/2605.22766
作者: Zhengyuan Dong,Renée J. Miller
类目: Information Retrieval (cs.IR)
备注: 8 pages excluding references. 5 figures

点击查看摘要

Abstract:Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exploration of alternatives. We argue that model search is inherently comparative: users want models that are task-aligned yet differentiated in measurable ways. We hypothesize that this balance requires retrieval over condensed, high-quality evidence rather than verbose descriptions, and much of that evidence is concentrated in structured tables. We present StructuredSemanticSearch, a table-driven model search framework built on the ModelTables benchmark. Given a query, StructuredSemanticSearch combines a semantic baseline for task alignment with a structure-aware pipeline that discovers query-related model-card tables using table discovery operators such as unionability, joinability, and keyword search. Retrieved tables are mapped back to model cards under a controlled top-k budget, enabling fair comparison between text-based and table-based retrieval. Beyond retrieval, StructuredSemanticSearch adapts table integration to the model-table domain through orientation-aware integration, producing compact integrated views of tables from partially overlapping and sometimes transposed evidence tables. For evaluation, we introduce a nugget-based, auditable protocol that extracts compact evidence items from model cards, matches queries to condition- or intent-specific nuggets, and measures evidence coverage and diversity over retrieved model-card candidate sets. This protocol also provides a scalable path toward approximate, evidence-based labeling in dynamic model lakes. Experiments on 597 model-recommendation queries show improved nugget coverage for the structure-aware pipeline than semantic baseline

[IR-1] One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

链接: https://arxiv.org/abs/2605.22544
作者: Yevhen Kostiuk,Kenneth Enevoldsen
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

[IR-2] Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

链接: https://arxiv.org/abs/2605.22511
作者: Zihan Liang,Yufei Ma,Ben Chen,Zhipeng Qian,Xuxin Zhang,Huangyu Dai,Lingtao Mao
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation (OFSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy’s inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches 0.440 average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

[IR-3] BeLink: Biomedical Entity Linking Meets Generative Re-Ranking SIGIR2026

链接: https://arxiv.org/abs/2605.22501
作者: Darya Shlyk,Stefano Montanelli,Lawrence Hunter
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ACM SIGIR 2026

点击查看摘要

Abstract:Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.

[IR-4] Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study KDD2026

链接: https://arxiv.org/abs/2605.22358
作者: Wenhao Zhang,Ruihao Yu,Yi Bai,Zhumin Chen,Pengjie Ren
类目: Information Retrieval (cs.IR)
备注: This work was initially submitted to kdd 2026 in August 2025

点击查看摘要

Abstract:While generative retrieval (GR) demonstrates competitive performance on standard retrieval benchmarks, existing approaches directly map queries to document identifiers (docids) without intermediate deliberation, limiting their effectiveness for complex queries that require multi-step reasoning. As a preliminary study on integrating chain-of-thought (CoT) into generative retrieval, we introduce ThinkGR, a unified framework that interleaves CoT with docid generation, enabling iterative thinking and retrieval within a single generative process. To bridge the gap between free-form thought generation and structured retrieval targets, we design (1) a hybrid decoding strategy that dynamically switches between unconstrained thought generation and constrained docid decoding, and (2) a two-phase training approach that first aligns thought-retrieval patterns through supervised fine-tuning, then optimizes thought quality via retrieval-grounded reinforcement learning. Experiments on four multi-hop retrieval benchmarks demonstrate that ThinkGR achieves state-of-the-art performance with an average improvement of +6.86%. Our work opens new avenues for enhancing generative retrieval with explicit deliberation capabilities, with promising implications for retrieval tasks requiring complex reasoning.

[IR-5] Direct content-based retrieval from music scores images

链接: https://arxiv.org/abs/2605.22255
作者: Noelia Luna-Barahona,Antonio Ríos-Vila,David Rizo,Jorge Calvo-Zaragoza
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 17 pages (14 pages + references), 3 figures (with subfigures)

点击查看摘要

Abstract:The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.

[IR-6] Behavior-Guided Candidate Calibration for Multimodal Recommendation

链接: https://arxiv.org/abs/2605.22073
作者: Zesheng Li,Chengchang Pan,Honggang Qi
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal recommendation benefits from content signals, but the gain depends on how those signals interact with the ranking pipeline. We find that moderate cross-view agreement helps, while stronger agreement suppresses recommendation-specific variation. Spectral analysis shows a clear split: low-frequency components capture shared structure, and higher-frequency components preserve more discriminative signal. Based on this finding, we introduce a behavior-guided candidate calibration model that converts training-only co-user overlap into signed candidate evidence and applies it only to the shortlist produced by the multimodal backbone. The backbone keeps the representation space stable; behavior evidence acts only where ranking is decided. Results on Amazon Baby, Sports, and Electronics show consistent gains over strong multimodal baselines. Code is available at this https URL.

[IR-7] From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

链接: https://arxiv.org/abs/2605.22003
作者: Dip Biswas Shanto,Mitali Yadav,Prajwal Panth,Suresh Chandra Satapathy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 9 figures. This is the author’s accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

点击查看摘要

Abstract:Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

[IR-8] Generative Conversational Recommender System

链接: https://arxiv.org/abs/2605.21987
作者: Sixiao Zhang,Mingrui Liu,Cheng Long
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Conversational recommender systems aim to provide personalized recommendations via natural language interactions. However, existing approaches either decouple recommendation from dialog generation or rely on retrieval-based pipelines, limiting the integration between recommendation and response generation and leading to suboptimal modeling of user intent. In this paper, we propose a fully generative conversational recommender system that unifies recommendation and dialog generation within a single autoregressive framework. Our approach represents items as discrete semantic IDs and integrates them directly into the generation process, enabling joint prediction of items and responses via next-token modeling. We further introduce a structured generation paradigm that factorizes conversational recommendation into a sequence of interdependent decisions, where the model first predicts the response intent and the recommendation target, and then generates the response conditioned on them. This design enables end-to-end optimization, enforces a more coherent dependency structure, and supports faithful item generation via constrained decoding. Extensive experiments demonstrate that our method consistently improves recommendation performance, achieving gains of up to 29% on Recall@1 over strong baselines, while maintaining competitive dialog quality.

[IR-9] LLM Retrieval for Stable and Predictable Ad Recommendations SIGIR2026

链接: https://arxiv.org/abs/2605.21969
作者: Vinodh Kumar Sunkara,Satheeshkumar Karuppusamy,Hangjun Xu,Sai Deepika Regani,Kshitij Gupta,Gaby Nahum,Sneha Iyer,Jean-Baptiste Fiot,Yinglong Guo,Xiaowen Guo,Atul Jangra,Yucheng Liu,Jinghao Yan,Vijay Pappu,Benjamin Schulte,Deepak Chandra
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2026 AgentSearch Workshop, Melbourne Australia

点击查看摘要

Abstract:Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG). With the hyper-growth of ads inventory and liquidity with generative AI technologies, the prediction stability and predictability is becoming increasingly critical. Intuitively, prediction stability and predictability can be defined to quantify system robustness with respect to minor/noisy input (ads, creatives) perturbations, the lack of which could lead to advertiser perceivable problems such as repeatability, cold start and under-exploration. In this paper, we introduce a new evaluation framework for quantifying stability and predictability of an ads recommender system, and present an online validated semantic candidate generation framework powered by fine-tuned Large Language Models (LLMs) that showed significant improvement along these metrics by fundamentally improving the semantic-awareness of the system. The approach extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion, ensuring the retrieved candidates encapsulate semantic variants of an ad, guaranteeing that small creative variants from the advertiser yield consistent and explainable delivery results to the user. We tested this LLM ads retrieval framework in a large-scale industrial ads recommendation system, demonstrating significant improvements across offline and online A/B experiments, showcasing gains in both predictability and traditional performance metrics. Although evaluated in the ads stack, this is a general framework that can be applied broadly to any large-scale recommendation and retrieval systems facing similar scaling and predictability challenges.

[IR-10] Reinforced Preference Optimization for Reasoning -Augmented Recommendations

链接: https://arxiv.org/abs/2605.21967
作者: Jingtong Gao,Zeyu Song,Chi Lu,Xiaopeng Li,Derong Xu,Maolin Wang,Peng Jiang,Kun Gai,Qingpeng Cai,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recommender systems are critical for delivering personalized content across digital platforms, and recent advances in Large Language Models (LLMs) offer new opportunities to enhance them with richer world knowledge and explicit reasoning capabilities. With the help of reasoning knowledge, recommendations can better infer users’ underlying intents, adapt to evolving preferences, and leverage semantic relationships for improved accuracy and interpretability. However, existing reasoning-based recommendation methods often fail to fully align the LLM’s reasoning process with recommendation-specific objectives due to structural disruption during integration and difficulties in translating free-form generation into accurate item predictions. In this paper, we introduce RPORec, a reinforced preference optimization framework that unifies an LLM backbone’s reasoning ability with a dedicated recommendation head (Rechead) for precise item retrieval. RPORec comprises two stages: (1) Reasoning-Augmented Recommendation Modeling, where high-quality Chain-of-Thought (CoT) reasoning is generated and used as auxiliary knowledge to guide the Rechead in learning recommendation-specific representations; and (2) Advanced Reasoning Refinement and Alignment, in which the trained Rechead produces verifiable rewards to fine-tune the LLM backbone via reinforcement learning, enhancing reasoning quality, structural consistency, and task relevance. Extensive experiments on public benchmarks and large-scale online deployments show that RPORec consistently outperforms state-of-the-art LLM-based recommendation methods, demonstrating the effectiveness of reasoning-augmented recommendation modeling in real-world systems.

[IR-11] Bridging the Cold-Start Gap: LLM -Powered Synthetic Data Generation for Natural Language Search at Airbnb

链接: https://arxiv.org/abs/2605.21812
作者: Wendy Ran Wei,Hao Li,Weiwei Guo,Xiaowei Liu,Xueyin Chen,Dillon Davis,Malay Haldar,Soumyadip Banerjee,Kedar Bellare,Huiji Gao,Stephanie Moyerman,Sanjeev Katariya
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb’s natural language search. For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage. We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces verbose queries with KL divergence of 12.03 vs. real users; our seed-guided approach achieves 0.66, a 7.5x improvement. For attribute type distributions, our approach achieves the lowest KL divergence (0.04), outperforming even seed queries (0.09). Experiments show our approach produces harder evaluation examples than the no-seed baseline (79% vs. 97% pairwise accuracy), providing discriminative signal for model improvement. We deploy production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2605.21812 [cs.IR] (or arXiv:2605.21812v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.21812 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

人机交互

[HC-0] MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data ALT

链接: https://arxiv.org/abs/2605.22775
作者: Amir Mousavi,Mohammad Sadegh Sirjani,Erfan Nourbakhsh,Mimi Xie,Rocky Slavin,Leslie Neely,John Davis,John Quarles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

点击查看摘要

Abstract:Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

[HC-1] CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation ALT

链接: https://arxiv.org/abs/2605.22774
作者: Amir Mousavi,Mohammad Sadegh Sirjani,Erfan Nourbakhsh,Mimi Xie,Rocky Slavin,Leslie Neely,John Davis,John Quarles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 7 figures. Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

点击查看摘要

Abstract:Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

[HC-2] Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM -Based and Acoustic Emotion Models

链接: https://arxiv.org/abs/2605.22732
作者: Juergen Dietrich
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.

[HC-3] Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts KR

链接: https://arxiv.org/abs/2605.22720
作者: Andrii Kryshtal
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint. 8 pages, 2 figures. Code and evaluation framework: this https URL

点击查看摘要

Abstract:AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi-turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6% to 47% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance’’ in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.

[HC-4] AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

链接: https://arxiv.org/abs/2605.22715
作者: Baiyu Chen,Zechen Li,Wilson Wongso,Lihuan Li,Xiachong Lin,Hao Xue,Benjamin Tag,Flora Salim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7%/11.6%/22.6% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9% and 28.6%, respectively, and improves zero-shot captioning BERT-F1 by 18.8%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: this https URL.

[HC-5] Beyond the Org Chart: AI and the Transformation of Invisible Work

链接: https://arxiv.org/abs/2605.22707
作者: Stephanie Rosenthal,Shamsi Iqbal
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages

点击查看摘要

Abstract:An increasing number of news and research articles report that AI adoption is allowing professionals to blur and extend the boundaries of their corporate roles. With the goal of understanding how work processes might be changing in an AI-forward company, we interviewed 24 product-focused individuals at a large technology firm about how AI has impacted their own work, their work within their product team, and their professional interactions. Our conversations suggest that AI is not only changing formal role responsibilities and collaborations between those roles, but also changing informal cultural practices like professional mentoring that are key to helping professionals settle in their positions, stay engaged with their work, and grow their careers. Some of these changes are positive, such as smoother collaboration between peers, but other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk. We propose steps that AI companies can take to make the invisible work more visible. Additionally, we propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.

[HC-6] he efficiency-gain illusion: People underestimate the rate of AI use and overestimate its benefits on simple tasks

链接: https://arxiv.org/abs/2605.22687
作者: Sunny Yu,Myra Cheng,Ahmad Jabbar,Ilia Sucholutsky,Katherine M. Collins,Dan Jurafsky,Robert D. Hawkins
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:People are increasingly turning to AI assistance for simple tasks, e.g., arithmetic, spell-check, and answering simple questions. But does AI assistance actually save users time and effort? We investigate people’s propensity to use AI for cognitively simple tasks and assess whether their reliance is well-calibrated. Across three pre-registered user studies (N = 2691), we find that people frequently choose to use AI even when doing so is inefficient (i.e. provides no meaningful time or effort savings). We identify systematic miscalibration at two levels: (1) a self-estimate miscalibration where people on average believe that they are using AI less than they actually are, and (2) efficiency-gain illusions where people overestimate how much time and effort savings AI use affords. We also identify a session-level carryover effect where a participant’s prior AI use leads to further AI adoption and entrenches their miscalibration about time savings. Our results shed light on the mechanisms and biases underlying people’s choice of whether to use AI as well as the risk of an overreliance feedback loop.

[HC-7] Student programming behavior with and without phone notification suppression

链接: https://arxiv.org/abs/2605.22657
作者: Gavin Eddington,Christopher Warren,Seth Poulsen,John Edwards
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Background and Context. Computer programming often involves extended periods of sustained activity and mobile phone notifications introduce frequent opportunities for interruption. Prior work demonstrates that suppressing phone notifications may reduce these disruptions. Objectives. Our primary research question is: How does suppressing phone notifications affect students’ task engagement and productivity while programming? Method. We report on a replication and methodological extension study conducted in a CS1 course involving 22 students. Using a within-subject design, selected programming assignments were randomly designated for enabling notification suppression. Phone state logs were synchronized with millisecond-resolution IDE keystroke data to measure student attention and focus when in the control and notification-suppression conditions. Findings. Assignments completed with notification suppression enabled significantly lower break rates and longer intervals of focus compared to assignments completed in the control condition for many, but not all, students. This study provides evidence that notification suppression is associated with measurable differences in programming engagement and behavior. We also find a remarkable bimodality in the effect across students – many students are positively affected, a small number are negatively affected, and very few experience little or no effect. This finding is consistent with other studies in diverse disciplines. Implications. Our results show that, for many students, phone notification suppression tools, such as Do Not Disturb, can improve attention and focus. Implications apply to educational settings (do-not-disturb as an intervention) and scholarship (understanding the effects of phone distraction). Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.22657 [cs.HC] (or arXiv:2605.22657v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.22657 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: John Edwards [view email] [v1] Thu, 21 May 2026 15:58:59 UTC (99 KB) Full-text links: Access Paper: View a PDF of the paper titled Student programming behavior with and without phone notification suppression, by Gavin Eddington and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.HC prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[HC-8] Summarizing Time-Varying Digital Image Correlation Strain Fields Using Sankey Diagrams

链接: https://arxiv.org/abs/2605.22627
作者: Victor Persson,Christofer Boo,Mohit Sharma,Ingrid Hotz
类目: Human-Computer Interaction (cs.HC); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Digital Image Correlation (DIC) enables dense, time-resolved measurement of surface strain in deforming materials, providing insight into strain localization and failure mechanisms. However, the resulting strain fields are typically explored frame-by-frame through spatial visualizations, making global temporal patterns difficult to discern. We present a visual summarization approach that represents the evolution of high-strain regions as a single Sankey diagram constructed from superlevel sets of the von Mises equivalent strain field. By tracking connected components over time via spatial overlap, the diagram encodes the birth, persistence, merging, and disappearance of strain concentrations. Applied to four tensile test datasets with varying notch geometries, the approach compactly captures differences in deformation regimes and qualitative precursors to failure, complementing traditional spatial strain visualizations with a global temporal overview.

[HC-9] Quantifying Full-Body Immersion

链接: https://arxiv.org/abs/2605.22521
作者: Alihan Bakir,Ekrem Yüksel,Fabio Zuliani,Neil Chennoufi,Francesco Bruno,Jamie Paik
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: This manuscript is under consideration for possible publication in the Nature. Copyright may be transferred to Nature if the manuscript is accepted for publication, without further notice

点击查看摘要

Abstract:Humanity is at the forefront of yet another digital revolution, where the lines between real and virtual worlds are dissolving, reshaping how we perceive and interact with our surroundings. In this context, we introduce a transformative paradigm for immersive virtual experiences centered around whole-body kinetic interactions. Our approach redefines immersion through three distinct levels: audio-visual immersion, capturing sensory realism; physical immersion, delivering haptic feedback; and full-body immersion (FBI), where dynamic bodily interaction integrates seamlessly with virtual environments. At the core of this innovation lies a scalable, distributable platform based on modular robotic surface units inspired by the adaptive designs of nature. These units enable the rendering of immersive environments at any scale, from intimate personal experiences to expansive multi-user settings, dynamically adapting to interactions in real-time. The modular system distributes force, shape, and motion feedback throughout entire spaces, replicating the physical characteristics of the environment and enabling new depth of engagement through FBI. By combining scalability, adaptability, and dynamic physical engagement, this framework bridges the gap between real and virtual worlds. It offers an unprecedented level of immersion where users can engage their entire bodies in symbiotic interactions with the virtual space. This work not only advances immersive technology but also redefines how humans and virtual environments coexist, setting a foundation for a new era of human-environment synthesis.

[HC-10] Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking

链接: https://arxiv.org/abs/2605.22509
作者: Morita Tarvirdians,Senthil Chandrasegaran,Hayley Hung,Catholijn M. Jonker,Catharine Oertel
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted at UMAP 2026

点击查看摘要

Abstract:Making high-stakes personal decisions involves cognitive, emotional, and intuitive processes, and individuals differ in how they allocate attention across these modes. Integration of these processes has shown to benefit decision making. Yet, most current decision-support systems focus primarily on supporting cognitive aspects, rather than adapting to the individual’s thinking profile to support integration of different types of thoughts. In this study, we investigate an agent designed to encourage integration by adapting to the individual user’s thought patterns. We explore its effects on participants’ perceptions of the agent and their reflective behavior, in comparison with unaided pre-reflection and a baseline agent. In a between-subjects study (N = 128), our agent, which fostered broad and elaborated thinking, enabled more personalized reflective trajectories, elicited more integrative reflective language, and was perceived as providing stronger support for holistic reflection. In contrast, the baseline agent produced homogenized profiles dominated by cognitive language across participants.

[HC-11] Perceived Safety of Workers in Encounters with Large Industrial AGVs ICRA2026

链接: https://arxiv.org/abs/2605.22461
作者: Ansgar Howey,Tim Schreiter,Andrey Rudenko,Achim J. Lilienthal
类目: Human-Computer Interaction (cs.HC)
备注: IEEE ICRA 2026 Workshop Proceedings: 8th Long-term Human Motion Prediction Workshop (LHMP 2026)

点击查看摘要

Abstract:Automated Guided Vehicles (AGV) in factory automation are increasingly capable of moving autonomously in close proximity to human workers. While their physical safety is regulated by standards and directives, perceived safety and workers comfort in close-proximity interactions are being actively investigated in studies. There are three limitations in the prior art research to that end. Firstly, AGVs with larger payloads are understudied. Secondly, the test participants are usually students and not working professionals. Thirdly, while conducting in-person experiments with heavy machinery can be dangerous, the transfer of safety perception results from simulated experiments remains open. In this paper, we investigate industrial workers perceived safety in shared spaces with large AGVs in a real-world encounter and in virtual reality. We vary the passing distance and the shape of the collision avoidance maneuver, and evaluate perceived threat level using a handheld pressure-sensitive trigger interface and a post-experiment questionnaire. Additionally, we ask participants to set their own collision avoidance parameters based on their experience with the demonstrated trajectory profiles. In a within-subject study, we found that, while the threat levels are perceived overall slightly higher in VR, the passing distance of 1.5 to 2 meters is preferred among the demonstrated profiles, as well as in the self-defined trajectories.

[HC-12] Cross-Subject EEG Emotion Recognition Based on Temporal Asynchronous Alignment Contrastive Learning

链接: https://arxiv.org/abs/2605.22379
作者: Ying Xie,Yi Zheng,Zehui Xiao,Wenkai Lu,Mengting Liu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:With the advancement of science and technology, the importance of emotion research has become increasingly evident. Electroencephalography (EEG)-based emotion recognition has emerged as an active research area in recent years, owing to its objectivity and high temporal resolution. However, most existing methods focus on optimizing encoder structures to enhance feature extraction capabilities, while paying relatively little attention to similarity calculation strategies, particularly overlooking the potential temporal misalignment of responses among different subjects. To address these shortcomings, this paper draws inspiration from the late interaction mechanism of ColBERT in natural language processing (NLP) and proposes a Temporal Asynchronous Alignment-based Contrastive Learning (TA2CL) framework. This method transforms the traditional global “hard alignment” similarity calculation approach into a fine-grained local matching mechanism, enabling the model to adaptively search for and align “locally highly correlated” segments between two EEG signals, thereby effectively mitigating the effects of inter-subject differences and temporal delays. Experimental results demonstrate that the proposed method achieves strong performance across multiple public datasets. Specifically, on the FACED dataset, it achieves an accuracy of 64.5% for the nine-class classification task and 79.5% for the binary classification task, while on the SEED and SEED-V datasets, it achieves accuracies of 86.4% and 70.1%, respectively, validating the method’s effectiveness and generalization capability.

[HC-13] Narrative Sharpens Gender Gaps: Surveying Film Characters with LLM Agents

链接: https://arxiv.org/abs/2605.22091
作者: Vivienne Bihe Chi,Reyhan Jamalova,Lyle Ungar,Sharath Chandra Guntuku
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mainstream film is one of the richest sources of cultural content that AI systems learn from. Yet we have few tools for measuring the gender values it encodes. We present a proof-of-concept framework that turns fictional film characters into surveyable LLM agents. Using 160 U.S. films (1990–2019), we build 734 character agents from script dialogue and scene descriptions, condense their personas via expert-style reflections, and simulate World Values Survey gender-attitude responses. Agents reproduce systematic gender differences without explicit demographic prompting, suggesting attitudes emerge from behavior rather than identity labels. Benchmarked against historical survey data, agents exaggerate gender gaps and show greater decade-to-decade volatility than real populations. Narrative sharpens rather than homogenizes gender contrasts, complicating the consistent-input assumption underlying cultivation theory’s mainstreaming mechanism. AI systems trained on such corpora may inherit this stylization before any model-level amplification occurs.

[HC-14] wo-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction CVPR

链接: https://arxiv.org/abs/2605.21869
作者: Dinithi Dissanayake,Shaveen Silva,Ovindu Atukorala,Prasanth Sasikumar,Suranga Nanayakkara
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10th Affective Behavior Analysis in-the-wild, CVPR Workshop 2026

点击查看摘要

Abstract:We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text–audio–vision–motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.

[HC-15] oward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

链接: https://arxiv.org/abs/2605.21825
作者: Haichao Miao,Zhimin Li,Kuangshi Ai,Kaiyuan Tang,Chaoli Wang,Peer-Timo Bremer,Shusen Liu
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts’ specified tasks and needs.

[HC-16] Co-Ontogeny by Archetypal Scaffolding: The Humorphic Partnership ICIP

链接: https://arxiv.org/abs/2605.21818
作者: Hector Ouilhet Olmos
类目: Human-Computer Interaction (cs.HC)
备注: 18 pages, 5 figures, 1 appendix. Open-source artifact at this http URL (MIT). Preregistered multi-participant replication study planned on OSF. Companion essay “The Humorphic Partnership” at this http URL . Design philosophy at this http URL

点击查看摘要

Abstract:We name and operationalise the humorphic partnership: a class of human-AI dyads in which both partners maintain externalised, evolving self-models in a shared substrate, and in which the partnership itself becomes a third object of analysis. The construct extends humorphism (Ouilhet Olmos, 2024) – “dismantle the user interface, build the human interface” – into the architecture of personal AI. We report a four-month, single-subject longitudinal trace of an open-source personal AI agent (“Alicia”) and her author. Of 181 interactions logged by archetype across April-May 2026, 85% invoke two growth-witnessing archetypes (Beatrice and Muse): the partnership operates as growth-witnessing rather than task assistance. A single voice-note seed propagates into a four-week conceptual arc both partners author: at T+10 hours, the agent reframes the seed as belonging “to both of us,” a framing the human then adopts. The three-order reflexion stack produces five consecutive weeks of honest self-reports about declining /improve effectiveness – including three consecutive weeks at 0.0%, named in writing rather than masked – contrasting engagement-maximising companion-agent patterns (Zhang et al., CHI 2025). The scheduled architecture-scout incorporates external research debate into proposed constitutional amendments. The partner’s parallel trajectory is anchored in a weekly delta document in which the partnership analyses itself as a unit distinct from either party. The human partner reports a movement toward greater continuity, self-recognition, and self-presence – a candidate hypothesis for the preregistered replication. Six operational conditions specify the construct, situated in a philosophical lineage (Maturana Varela, Simondon, Clark Chalmers, De Jaegher Di Paolo); the system is released as open-source with a preregistered replication study.

[HC-17] Understanding Perspectives of Patients Caregivers and Clinicians towards Emerging Collaborative-decision Making Technologies ALT

链接: https://arxiv.org/abs/2605.21777
作者: Ray-Yuan Chung,Athena Ortega,Zixuan Xu,Daeun Yoo,Jaime Snyder,Wanda Pratt,Aaron Wightman,Ryan Hutson,Cozumel Pruette,Ari Pollack
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at The Workshop on Interactive Systems in Healthcare (WISH) at AMIA Annual Symposium 2025

点击查看摘要

Abstract:In pediatrics, patients, caregivers, and clinicians share responsibility for health decisions, but limited collaboration can undermine outcomes. We conducted a qualitative study examining decision-makers perceptions toward collaborative decision-making technologies, including interactive dashboards, VR simulators, and AI voice assistants. Findings reveal differences in user opinions across groups and indicate technology acceptance is linked to users trust of these technologies. Technology developers and researchers need to explore design and implementation strategies that build and facilitate trust or appropriate distrust between users and these novel technologies before these tools can effectively support collaborative decision-making.

[HC-18] he Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

链接: https://arxiv.org/abs/2605.21695
作者: Shang Wu,Hongyu Yao,Catarina Belem,Shuyuan Fu,Mark Steyvers,Padhraic Smyth
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at Hybrid Human Artificial Intelligence (HHAI) 2026

点击查看摘要

Abstract:Artificial intelligence (AI) is being increasingly integrated into human problem-solving, yet its effects on individual skill development remain unclear. We examine how both AI usage and informativeness can shape learning in the context of a controlled logical reasoning task with on-demand access to AI assistance. We find that greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI. We also find in our study that these patterns are mediated by AI informativeness. Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall. On the other hand, high-information AI was found to improve short-run performance without reducing post-AI outcomes on average in our experiments, but with heterogeneous effects. Our findings in general suggest that AI can, depending on context, either complement human skill development by amplifying independent reasoning or can act as a substitute that undermines such reasoning, with the implication that regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.

[HC-19] Addressing the Synergy Gap: The Six Elements of the Design Space

链接: https://arxiv.org/abs/2605.21635
作者: Tommaso Turchi,Ben Wilson,Matt Roach,Alan Dix,Alessio Malizia
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:AI is now embedded in healthcare, finance, policy, and many other domains, yet genuine human-AI synergy - combined performance that exceeds what either party achieves alone - is uncommon. Meta-analyses show that AI assistance tends to improve human performance compared to working alone, but studies finding true synergy are scarce. We call this persistent shortfall the synergy gap. Most current work treats human-AI combination as an engineering problem and concentrates on interpretability, trust calibration, or interface design. These matter, but they cover only part of what determines whether combination works. Closing the synergy gap, we argue, requires explicit engagement with a wider design space. We map that space through six interconnected elements: sociotechnical context, decision-making frameworks, human decision participants, AI capabilities, interaction, and holistic evaluation. For each element, we describe what it covers, how it shapes the others in practice, and what it implies for design. The result is a shared vocabulary for practitioners building hybrid systems, an analytical lens for researchers studying combination patterns, and a starting point for evaluators interested in the full quality of human-AI decision-making rather than accuracy alone.

[HC-20] Faster Completion Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build

链接: https://arxiv.org/abs/2605.21629
作者: Sina Rismanchian,Hasan Uzun,Jeffrey Matayoshi,Eric Cosyn,Eyad Kurd-Misto
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:How much have students’ ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of 3.2 million ALEKS learning interactions for the time-on-task analysis, complemented by ALEKS PPL placement-assessment data for the proctoring and retention analyses, with a quasi-experimental design exploiting within-curriculum variation in AI susceptibility: text-based word problems transcribable into AI prompts serve as the treated group; graph-based problems requiring interactive platform manipulation as the comparison. Learning time on AI-susceptible problems declines 2.8% per quarter among college students after ChatGPT’s release, cumulating to 26.9% over eleven quarters; high-schoolers show 31.3% , middle-schoolers 9.0% , and Grade 5 students no detectable change. The divergence vanishes entirely under proctoring for college students, making general efficiency gains unlikely. Logistic fixed-effects models on randomly assigned proctored retention items yield a 25% cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase – inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build – the population-level indicator of \emphcognitive surrender, with direct implications for educational research, assessment governance, and AI policy.

[HC-21] Exploring the Effectiveness of Using LLM s for Automated Assessment of Student Self Explanations in Programming Education

链接: https://arxiv.org/abs/2605.21614
作者: Arun-Balajiee Lekshmi-Narayanan,Mohammad Hassany,Peter Brusilovsky
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student’s explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor’s or domain expert’s explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.

[HC-22] Simulating Learners Task-Selection Strategies and System Constraints in Mastery Learning

链接: https://arxiv.org/abs/2605.21613
作者: Haley Noh,Aarna Chowdhary,Jeroen Ooge,Vincent Aleven,Conrad Borchers
类目: Human-Computer Interaction (cs.HC)
备注: Accepted as short paper to the 19th annual Educational Data Mining conference (EDM '26)

点击查看摘要

Abstract:Intelligent Tutoring Systems often grant learners shared control over skill and problem selection. Prior work suggests learners exhibit diverse task-selection strategies, such as avoiding challenge, which may interact with mastery learning systems that optimize task selection based on estimated knowledge. Algorithmic constraints on problem selection may help mitigate these effects, but testing such constraints in classrooms is costly. We propose a simulation-based framework to examine how learner task-selection strategies and system constraints shape mastery learning efficiency. Using interaction data from 261 students across two mathematical domains (equation solving and graph interpretation), we simulate strategies such as Weakness Targeting and Interleaving. We evaluate how these strategies affect overpractice as a measure of efficiency. Results show substantial variability across strategies, with risk-averse strategies producing higher levels of overpractice, especially for complex multi-step problems. Targeted system constraints significantly reduce inefficiencies for maladaptive strategies while minimally affecting already efficient strategies. These findings show how simulation grounded in student data can guide the redesign of shared-control tutoring systems before classroom deployment.

[HC-23] When Support Escalates Distress: Regulation and Escalation in LLM Responses to Venting and Advice-Seeking

链接: https://arxiv.org/abs/2605.21569
作者: Vivienne Bihe Chi,Adithya V Ganesan,Ryan L Boyd,Lyle Ungar,Sharath Chandra Guntuku
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for mental health support, yet little is known about whether their responses are psychologically safe across different help-seeking styles. We examine a foundational distinction in emotional disclosure, venting vs. advice-seeking, and whether LLMs respond in ways that regulate or amplify distress. Using 178,800 Reddit posts, we first show the two help-seeking styles are linguistically distinguishable at scale. We then introduce a measurement framework grounded in interpersonal emotion regulation theory that captures Regulation and Escalation as empirically independent dimensions. Across persona conditions (default, friend, therapist), GPT-5.3 responses systematically mirror help-seeking style: venting elicits more regulation, but also more escalation. Therapist personas reduce escalation while maintaining regulation, whereas friend personas increase both. A crowdsourced human study finds no user experience penalty for the safer therapist condition, but reveals that lay raters cannot reliably detect escalation without expert knowledge. Responses that feel supportive may simultaneously intensify distress in ways standard safety evaluation cannot see, and empathy metrics alone cannot replace a framework that measures both.

[HC-24] Requirements Perception Gap across Stakeholders: A Comparative Survey of Aged Care Digital Health Software

链接: https://arxiv.org/abs/2605.21495
作者: Yuqing Xiao,John Grundy,Anuradha Madugalla,Elizabeth Manias
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We sought to explore and compare the perspectives of three key stakeholder groups: older adults, caregivers (formal health providers and informal caregivers), and digital health software developers on key functional and non-functional requirements. We conducted a survey, designed based on the findings from an existing systematic review, to gather and analyse data related to the three stakeholder groups’ (dis)satisfaction with current aged care digital health software and their views on key future aged care software requirements. A mixed-methods survey approach integrated quantitative questionnaire data and qualitative open-ended responses from a total sample of 249, comprised of older adults (103), formal and informal caregivers (41), and software developers (105). Data analysis utilised a mixed methods approach, employing inferential statistics to compare group satisfaction levels and thematic analysis for qualitative open-ended responses. Our analysis reveals a significant “Requirements Gap”. Software developers tend to prioritise advanced features and functional requirements, significantly overestimating user satisfaction with core NFRs such as ease of use and responsiveness. Conversely, developers were more critical of existing functional features compared to older adults and caregivers, who prioritised simplicity and reliability over feature density. By combining quantitative and qualitative analysis, we identified where stakeholder priorities align and where they diverge across functional and non-functional requirements in both the current designs they used and the future designs they desire. Our findings present a stakeholder gap analysis that can guide future co-design processes, near-term product decisions, and privacy-by-design recommendations in aged care digital health.

[HC-25] Not Yet: Humans Outperform LLM s in a Colonel Blotto Tournament

链接: https://arxiv.org/abs/2605.22095
作者: Dmitry Dagaev,Egor Ivanov,Petr Parshakov,Alexey Savvateev,Gleb Vasiliev
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) has spurred economists to study how humans and LLMs behave in strategic settings. We organized a series of round-robin tournaments in the Colonel Blotto game. This game attracts game theorists’ attention due to high-dimensional action space and the absence of pure strategy Nash equilibria. In the first tournament, more than 200 human participants competed against one another. In the second tournament, several popular LLMs were invited to submit strategies. In the third tournament, we matched the number of LLM strategies to the number submitted by humans. We find that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success: participants with STEM backgrounds perform better in the first tournament. Surprisingly, humans almost do not adjust their strategies across tournaments with different sets of opponents. This result suggests that humans base their choices primarily on the game’s rules rather than on the identity of their opponents, treating LLMs much like human competitors.

计算机视觉

[CV-0] Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLM s

链接: https://arxiv.org/abs/2605.22823
作者: Jongseo Lee,Hyuntak Lee,Sunghun Kim,Sooa Kim,Jihoon Chung,Jinwoo Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 59 pages, including appendix. Code: this https URL

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: this https URL

[CV-1] Cambrian-P: Pose-Grounded Video Understanding

链接: https://arxiv.org/abs/2605.22819
作者: Jihan Yang,Zifan Zhao,Xichen Pan,Shusheng Yang,Junyi Zhang,Bingyi Kang,Hu Xu,Saining Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

[CV-2] MotiMotion: Motion-Controlled Video Generation with Visual Reasoning ICML2026

链接: https://arxiv.org/abs/2605.22818
作者: Lee Hsin-Ying,Hanwen Jiang,Yiqun Mei,Jing Shi,Ming-Hsuan Yang,Zhixin Shu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project page: this https URL

点击查看摘要

Abstract:Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

[CV-3] AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation CVPR2026

链接: https://arxiv.org/abs/2605.22816
作者: Wenxuan Guo,Xiuwei Xu,Yichen Liu,Xiangyu Li,Hang Yin,Huangxing Chen,Wenzhao Zheng,Jianjiang Feng,Jie Zhou,Jiwen Lu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent’s state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: this https URL.

[CV-4] GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

链接: https://arxiv.org/abs/2605.22812
作者: Wenxuan Guo,Ziyuan Li,Meng Zhang,Yichen Liu,Yimeng Dong,Chuxi Xu,Yunfei Wei,Ze Chen,Erjin Zhou,Jianjiang Feng
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: this https URL.

[CV-5] Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving CVPR2026

链接: https://arxiv.org/abs/2605.22809
作者: Jiahao Wang,Bo Sun,Yijing Bai,Vincent Casser,Songyou Peng,Zehao Zhu,Meng-Li Shih,Xander Masotto,Shih-Yang Su,Kanaad V Parvate,Tiancheng Ge,Linn Bieske,Dragomir Anguelov,Mingxing Tan,Chiyu Max Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor’s practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.

[CV-6] DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

链接: https://arxiv.org/abs/2605.22777
作者: Tianhang Wang,Yitong Chen,Wei Song,Zuxuan Wu,Min Li,Jiaqi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction–generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3 \times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

[CV-7] Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition CVPR2026

链接: https://arxiv.org/abs/2605.22767
作者: Ganlin Feng,Yuxi Long,Erin Lou,Lianghong Chen,Zihao Jing,Pingzhao Hu,Wei Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 CV4CHL workshop

点击查看摘要

Abstract:Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children’s healthcare.

[CV-8] Spectral Tail Auxiliary Learning for AI-Generated Image Detection

链接: https://arxiv.org/abs/2605.22751
作者: Xingyi Li,Jiahui Zhang,Yiheng Li,Yun Cao,Wenhao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or high-frequency discrepancies. However, the specific and recurring spectral regularities remain insufficiently understood and characterized. In this paper, we systematically analyze the one-dimensional radial log-power spectra of real and generated images. We find that generated images do not necessarily exhibit higher or lower energy across the entire spectrum or high-band range. Instead, their spectra deviate from the power-law decay and show an anomalous uplift in the ultra-high-frequency tail. We term this phenomenon spectral tail uplift. We further attribute this phenomenon to nonlinear harmonic accumulation in trained generative models, suggesting that it can serve as a structural cue across generative architectures. Based on this observation, we propose Spectral Tail Auxiliary Learning (STAL), a frequency-domain auxiliary supervision framework for generalizable AI-generated image detection. STAL transfers spectral-tail cues from a tail-aware frequency teacher to a spatial detector during training, while all frequency-domain modules are discarded at inference time. Consequently, STAL introduces no inference overhead. Extensive experiments on 9 public datasets show that STAL achieves strong generalization and stability across generators, data distributions, and real-world scenarios.

[CV-9] WorldKV: Efficient World Memory with World Retrieval and Compression

链接: https://arxiv.org/abs/2605.22718
作者: Jung Yi,Minjae Kim,Paul Hyunbin Cho,Wooseok Jang,Sangdoo Yun,Seungryong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: this https URL

[CV-10] Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions ICPR2026

链接: https://arxiv.org/abs/2605.22697
作者: Yannick Porto,Renato Martins,Thomas Chalumeau,Cedric Demonceaux
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR 2026. Code and trained models available at: this https URL

点击查看摘要

Abstract:Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: this https URL

[CV-11] Improving Viewpoint-Invariance and Temporal Consistency for Action Detection ICIP2026

链接: https://arxiv.org/abs/2605.22695
作者: Yannick Porto,Renato Martins,Thomas Chalumeau,Cedric Demonceaux
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2026. Code and trained models are available at: this https URL

点击查看摘要

Abstract:Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows. This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties. In the first stage, we extract motion features from augmented virtual viewpoints, solely used at training. Then, the second stage introduces a new view-invariant, multi-scale temporal encoder based on selective state-space sequence modelling to aggregate information across viewpoints and time scales. Experiments on PKU-MMD and BABEL benchmarks demonstrate that this approach significantly outperforms state-of-the-art methods in all considered splits. Code and trained models are available at: this https URL

[CV-12] Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

链接: https://arxiv.org/abs/2605.22679
作者: Piotr Kubaty,Patryk Marszałek,Łukasz Struski,Adam Wróbel,Jacek Tabor,Marek Śmieja
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top- k sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

[CV-13] Swift Sampling: Selecting Temporal Surprises via Taylor Series

链接: https://arxiv.org/abs/2605.22678
作者: Dahye Kim,Bhuvan Sachdeva,Karan Uppal,Naman Gupta,Vineeth N. Balasubramanian,Deepti Ghadiyaram
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain’s predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

[CV-14] Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment CVPR’26

链接: https://arxiv.org/abs/2605.22677
作者: Janek Haberer,Jon Eike Wilhelm,Olaf Landsiedel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Mobile AI Workshop 2026 (CVPR’26 Workshop)

点击查看摘要

Abstract:Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt’s modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT’s 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

[CV-15] From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

链接: https://arxiv.org/abs/2605.22671
作者: Bing Hu,Zaijing Li,Rui Shao,Junda Chen,April Hua Liu,Wei-Shi Zheng,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbfBehaviorVLA, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbfVisuomotor Behavior Encoder (VBE), which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the \textbfPhase-conditioned Behavior Decoder (PBD), which decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. Experiments on RoboTwin 2.0, LIBERO, and CALVIN demonstrate state-of-the-art success rates of 58%, 98%, and 4.36 (this http URL), respectively. Notably, in real-world sim-to-real transfer, BehaviorVLA matches the performance of OpenVLA-OFT using only 50% of the demonstration data, showcasing its superior data efficiency and generalization.

[CV-16] SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

链接: https://arxiv.org/abs/2605.22668
作者: Javad Rajabi,Kimia Shaban,Koorosh Roohi,David B. Lindell,Babak Taati
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 14 figures. Project page: this https URL

点击查看摘要

Abstract:Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent’s spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

[CV-17] SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation CVPR2026

链接: https://arxiv.org/abs/2605.22658
作者: Zhenyu Lu,Liupeng Li,Jinpeng Wang,Haoqian Kang,Yan Feng,Ke Chen,Yaowei Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables

点击查看摘要

Abstract:While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque “black boxes”. Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a “white-box” connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at this https URL.

[CV-18] What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

链接: https://arxiv.org/abs/2605.22651
作者: Hyejin Go,Semi Lee,Hyesong Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 4 tables. Preprint

点击查看摘要

Abstract:CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

[CV-19] From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

链接: https://arxiv.org/abs/2605.22649
作者: Yilin Zhang,Nicholas C. Harvey,Nicholas R. Fuggle,Rahman Attar
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

点击查看摘要

Abstract:Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction–action–prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

[CV-20] AtomicMotion: Learning Human Motion From Different Human Parts

链接: https://arxiv.org/abs/2605.22631
作者: Runzhen Liu,Chuhua Xian,Fa-Ting Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents’’ embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.

[CV-21] H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

链接: https://arxiv.org/abs/2605.22629
作者: Zhanbo Huang,Xiaoming Liu,Yu Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication

[CV-22] GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

链接: https://arxiv.org/abs/2605.22619
作者: Shuo Jiang,Yuhao Hong,Chunbo Jiang,Weihong Chen,Huangwei Chen,Shenghao Zhu,Beining Wu,Mingxuan Liu,Zhu Zhu,Feiwei Qin,Min Tan,Yifei Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.

[CV-23] Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

链接: https://arxiv.org/abs/2605.22607
作者: Shijing Wang,Yaping Huang,Chaoqun Cui,David Wong,Yihua Cheng,Alexandros Neophytou,Hyung Jin Chang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.

[CV-24] Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

链接: https://arxiv.org/abs/2605.22605
作者: Liuyang Wang,Feitian Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

[CV-25] Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

链接: https://arxiv.org/abs/2605.22591
作者: Zitong Li,Haoyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields \chi^2 = 333.2 ( p = 4.77 \times 10^-68 ), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53–61%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40% noise, Co-Teaching reaches 68% overall accuracy while collapsing to 35.1% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.

[CV-26] SceneAligner: 3D-Grounded Floorplan Localization in the Wild

链接: https://arxiv.org/abs/2605.22581
作者: Junhyeong Cho,Ruojin Cai,Hadar Averbuch-Elor
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Many public buildings provide floorplans with a “you are here” indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

[CV-27] Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping

链接: https://arxiv.org/abs/2605.22578
作者: Chouaib Bencheikh Lehocine,Adam Lilja,Junsheng Fu,Lars Hammarstrand
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies predominantly on mean average precision (mAP) based on thresholded Chamfer distance (CD). This framework lacks sensitivity to point ordering and provides limited granularity in assessing geometric quality, making it difficult to distinguish which methods truly excel over others. In this work, we address these limitations on two fronts. For the single-instance similarity measure, we introduce sequence optimal sub-pattern assignment (SOSPA), an order-aware metric that enables fine-grained evaluation of individual geometries while satisfying all metric axioms. For the multi-instance evaluation framework, we propose polyline localisation and detection (PLD), a soft metric that jointly captures detection quality and geometric accuracy, replacing the hard thresholding of mAP with a principled soft assignment. Through evaluations on nuScenes, we demonstrate that PLD effectively ranks SOTA online mapping methods (MapTRv2, StreamMapNet, MapTracker) while providing a decomposed error analysis. This analysis identifies detection capability as the dominant bottleneck in current methods, revealing a performance trend that mAP fails to capture. Code for evaluation using our metrics will be released.

[CV-28] SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation

链接: https://arxiv.org/abs/2605.22572
作者: Hasaan Maqsood,Saif Ur Rehman Khan,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimensional residual encoder–decoder network introducing a novel SegAttentionGate module that explicitly supervises the decoder to produce spatially discriminative attention maps for each tumour sub-region necrotic core, peritumoral oedema, and enhancing tumour via a lightweight auxiliary loss, adding less than 0.2% parameter overhead. This sub-region supervision maintains decoder discriminability between visually ambiguous classes while providing free-of-cost spatial interpretability at inference without any post-hoc explanation method. Evaluated independently on BraTS2021 and BraTS2023 GLI across 251 held-out subjects each, SegGuidedNet achieves mean Dice of 0.905 (ET= 0.873, TC=0.906, WT=0.935) and 0.897 (ET=0.859, TC=0.902, WT=0.931) respectively, surpassing ensemble-based nnU-Net and HNF-Netv2 as a single model and approaching Swin UNETR a 10-model ensemble within 2–4 Dice points at a fraction of the inference cost. The consistency of results across two benchmark editions further confirms the generalisability of the proposed approach, offering competitive accuracy with built-in interpretability in a lightweight, clinically practical framework.

[CV-29] VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

链接: https://arxiv.org/abs/2605.22570
作者: Jinho Park,Youbin Kim,Hogun Park,Eunbyung Park
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 82 pages, 91 figures (7 in main paper, 84 in appendix). Project page: this https URL

点击查看摘要

Abstract:Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

[CV-30] Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain ICIP

链接: https://arxiv.org/abs/2605.22563
作者: Francesco Benedetto,Roberto Basla,Luca Magri,Giacomo Boracchi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, Accepted at the International Conference on Image Processing (ICIP) 2026

点击查看摘要

Abstract:Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to address important medical problems like tissue repair or cancer treatment. Generating synthetic videos along with their Ground Truth annotations is a promising solution that relies, as a foundational first step, on the synthesis of single cell annotations (or phantoms). Phantoms need to be time consistent, as they have to replicate biological processes that are specific to the cell types. In this work, we propose a novel framework for generating videos of cell phantoms in the Elliptical Fourier Descriptors (EFDs) domain, a compact and geometrically interpretable representation for 2D closed contours. We represent the cell phantom evolution as a multivariate time series of EFD coefficients, introducing a strong prior for cell morphology and enabling the efficient generation of sequences that evolve coherently in time. Our experimental validation proves that modelling the temporal evolution in EFD space enables the generation of biologically plausible phantom videos. Our method can be used in generative pipelines for synthesizing annotated data for cell tracking, thus strongly mitigating the annotation effort for creating new datasets. Our code is available for download here: this https URL.

[CV-31] GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

链接: https://arxiv.org/abs/2605.22558
作者: Deshui Miao,Xingsen Huang,Yameng Gu,Xin Li,Haijun Zhang,Ming-Hsuan Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at this https URL .

[CV-32] FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

链接: https://arxiv.org/abs/2605.22552
作者: Haokun Wen,Xuemeng Song,Xinghao Xie,Xiaolin Chen,Xiangyu Zhao,Weili Guan
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at this https URL.

[CV-33] MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

链接: https://arxiv.org/abs/2605.22550
作者: Varun A. Paturkar,Shankar Gangisetty,C.V. Jawahar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 1,629 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https: //varuniiith.this http URL

[CV-34] Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

链接: https://arxiv.org/abs/2605.22547
作者: Yiming Xu,Yixuan Liu,Yuhang Zhang,Ling Zheng,Yihan Wang,Qi Song
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by historical similar cases and their associated symptoms. To simulate this diagnostic process, we propose a framework that performs case-aware reasoning using multimodal knowledge graphs for explainable medical image diagnosis. Given an input image, our method constructs a multimodal knowledge graph from adaptively retrieved similar cases, enabling more effective utilization of related samples. We further introduce a knowledge propagation and injection mechanism, where an image-centric Graph Attention Network propagates knowledge semantics to obtain case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, adaptively adjusting its contribution to the final prediction and providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets show that our approach consistently outperforms strong baselines, and ablation studies validate the effectiveness of each component. The source code is publicly available at this https URL.

[CV-35] Segment Anything with Motion Geometry and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

链接: https://arxiv.org/abs/2605.22538
作者: Deyi Zhu,Yuji Wang,Yong Liu,Yansong Tang,Bingyao Yu,Jiwen Lu,Jie Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2–based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at this https URL.

[CV-36] LACO: Adaptive Latent Communication for Collaborative Driving

链接: https://arxiv.org/abs/2605.22504
作者: Tianhao Chen,Yuheng Wu,Dongman Lee
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbfLAtent \textbfCOmmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

[CV-37] raining-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline CVPR2026

链接: https://arxiv.org/abs/2605.22492
作者: Sebastian Cavada,Francesco Pelosin,Lapo Faggi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 13th Workshop on Fine-Grained Visual Categorization, CVPR 2026

点击查看摘要

Abstract:Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage framework that decouples segmentation from classification. SAM3 first produces class-agnostic mushroom masks using macro-taxonomic prompts, and DINOv3 then assigns fine-grained labels through prototype matching in the embedding space. To improve this stage, we apply a simple transformation of the DINOv3 feature space that improves prototype-based classification. Compared with class-specific prompting, our approach is more scalable and keeps the segmentation cost low. We report results from one-shot to few-hundred-shot regimes, providing, to the best of our knowledge, the first baseline for fine-grained semantic segmentation in low-data settings.

[CV-38] Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

链接: https://arxiv.org/abs/2605.22484
作者: David Méndez,Roberto Confalonieri,Natalia Díaz Rodríguez
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

[CV-39] Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agent s for Zero-Shot Compositional Image Retrieval

链接: https://arxiv.org/abs/2605.22478
作者: Xingtian Pei,Yukun Song,Changwei Wang,Shunpeng Chen,Rongtao Xu,Shibiao Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures,4 tables

点击查看摘要

Abstract:Zero-Shot Compositional Image Retrieval (ZS-CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitutes the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Law (TTS) into ZS-CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate pool. Subsequently, the Decision Manager combines a Training-free Reasoning Policy Distillation mechanism with a Tournament-style TTS strategy to achieve self-evolving fine-grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience-driven self-evolution and TTS represent a highly promising and scalable path for achieving zero-shot fine-grained multimedia retrieval. The code will be made publicly available upon acceptance.

[CV-40] MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

链接: https://arxiv.org/abs/2605.22469
作者: Patryk Bartkowiak,Lennart Petersen,Bartosz Kotrys,Dominik Michels,Soren Pirk,Wojtek Palubicki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.

[CV-41] SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data

链接: https://arxiv.org/abs/2605.22467
作者: Patryk Bartkowiak,Bartosz Kotrys,Dominik Michels,Soren Pirk,Wojtek Palubicki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline .

[CV-42] Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light CVPR2026

链接: https://arxiv.org/abs/2605.22455
作者: Valeria Pais,Malena Mendilaharzu,Daniele Faccio,Luis Oala,Christoph Clausen,Bruno Sanguinetti
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optics (physics.optics)
备注: Accepted non-archival paper at the CVPR 2026 AUTOPILOT Workshop (Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks)

点击查看摘要

Abstract:Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluations. Synthetic data can fill these gaps, providing us with a way to sample the input space more continuously and improve data coverage for benchmarks. Focusing on the autonomous driving safety-critical case of pedestrian detection in the dark, we show how synthetic low-light samples can be used to better characterize the performance of a state-of-the-art object detection model as a function of the scene illumination. We use a synthetic RAW image augmentation technique to generate low-light samples that match the noise model of the camera sensor. Performance metrics on real and synthetic low-light data are similar, indicating that the AI model finds it hard to distinguish between them.

[CV-43] Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

链接: https://arxiv.org/abs/2605.22446
作者: Zhen Sun,Yongjian Guo,Haoran Sun,Luqiao Wang,Wei Lu,Jiachi Ji,Shengzhe Ji,Junwu Xiong,Zhijun Meng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79% to 37.62% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

[CV-44] Moment-Reenacting: Inverse Motion Degradation with Cross-shutter Guidance

链接: https://arxiv.org/abs/2605.22423
作者: Ji Xiang,Lin Guixu,Yin Zhengwei,Zhao Jiancheng,Zheng Yinqiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TPAMI

点击查看摘要

Abstract:Motion degradation, manifested as blur in global shutter (GS) images or rolling shutter (RS) distortion in RS counterparts, remains a fundamental challenge in computational imaging, especially under fast motion or low-light conditions. While prior works have treated blur decomposition and RS temporal super-resolution as separate tasks, this separation fails to exploit their intrinsic complementarity. In this paper, we propose a unified framework to invert motion degradation and reenact imaging moment by jointly leveraging the complementary characteristics of GS blur and RS distortion. To this end, we introduce a novel dual-shutter setup that captures synchronized blur-RS image pairs and demonstrate that this combination effectively resolves temporal and spatial ambiguities inherent in both modalities. For allowing flexible performance-cost trade-offs, we further extend this dual-shutter setup to a stereo Blur-RS configuration with a narrow baseline. In addition, we construct a triaxial imaging system to collect a real-world dataset with aligned GS-RS pairs and ground-truth high-speed frames, enabling robust training and evaluation beyond synthetic data. Our proposed network explicitly disentangles motion into context-aware and temporally-sensitive representations via a dual-stream motion interpretation module, followed by a self-prompted frame reconstruction stage. Extensive experiments validate the superiority and generalizability of our approach, establishing a new paradigm for realistic high-speed video reconstruction under complex motion degradations. Codes and more resources are available at this https URL.

[CV-45] FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

链接: https://arxiv.org/abs/2605.22422
作者: Laziz Hamdi,Amine Tamasna,Pascal Boisson,Thierry Paquet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Table structure recognition (TSR) requires both table-level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid-centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long-range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables-1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low-latency inference. We further study robustness under pixel-level anonymisation and show an extension to curved separators for camera-captured documents. The source code will be made publicly available at this https URL .

[CV-46] Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction ICRA2026

链接: https://arxiv.org/abs/2605.22420
作者: Henry Che,Jingkang Wang,Yun Chen,Ze Yang,Sivabalan Manivasagam,Raquel Urtasun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICRA 2026. Project page: this https URL

点击查看摘要

Abstract:Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

[CV-47] he Neglected Baseline in Model Interpretation

链接: https://arxiv.org/abs/2605.22417
作者: Yongjin Cui,Xiaohui Fan
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretation. In this paper, we reformulate the task of model interpretation and the interpretation principles for model interpretation results to demonstrate the importance of the baseline. We further unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion, clarifying the connections among them and explicitly identifying the baseline for each method. On this basis, we analyze the flaws and errors in related model interpretation methods (IG, LayerCAM, ODAM, Difference Map). We advocate evaluating the quality of model interpretation results precisely through the attribution error between the attribution result and the attribution target, rather than adopting flawed evaluation methods, such as those based on marginal-effect or the assumption of perfect model performance. We revise IG and develope a model interpretation method with a clear and reasonable baseline, achieving better results. Our method supports model interpretation based on features from any layer. Interpretation based on features from different layers are all reasonable, and the differences among these results reflect varying degrees of feature extraction at different feature extraction stages.

[CV-48] owards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

链接: https://arxiv.org/abs/2605.22414
作者: Xingyue Wang,Bo Liu,Meng Wang,Zhixuan Zhang,Chengcheng Zhu,Huazhu Fu,Jiang Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.

[CV-49] From Recognition to Reasoning : Benchmarking and Enhancing MLLM s on Real-World Receipt Document Understanding

链接: https://arxiv.org/abs/2605.22413
作者: Yandi Wang,Libin Zhan,Ziwei Huang,Tiancheng Luo,Yuxuan Jiang,Wang Dong,Leilei Gan,Jun Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at this https URL.

[CV-50] ranslating Signals to Languages for sEMG-Based Activity Recognition

链接: https://arxiv.org/abs/2605.22403
作者: Ming Wang,Haoxuan Qu,Qiuhong Ke,Wei Zhou,Hossein Rahmani,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into sEMG language, integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.

[CV-51] AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

链接: https://arxiv.org/abs/2605.22366
作者: Zi Ye,Yibin Wen,Xiaoya Fan,Xinyu Zhang,Jing Wu,Kun Zeng,Zurong Mai,Jiarui Zhang,Bohan Shi,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at this https URL.

[CV-52] GazePrior: Zero-Shot AR/VR Eye Tracking via Learned 3D Gaze Reconstruction

链接: https://arxiv.org/abs/2605.22359
作者: Corentin Dumery,David Colmenares,Alexander Fix,Pascal Fua,Ali Behrooz,Jogendra Kundu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Eye tracking (ET) is a foundational technology for advanced AR/VR applications. However, training ET models for every new ET device is challenging: real data collection is costly and time-consuming, while existing synthetic data generation methods lack realism. To remove the need for additional data collection while maintaining data quality, we introduce a data-driven 3D prior that models the distribution of human eyes across diverse identities, gaze directions, and light settings. This model, which we coin GazePrior, then enables sparse-input 3D reconstruction of annotated data collected with previous ET devices, which can in turn be rendered from the cameras of any target ET device. Our approach synthesizes data with the realism, diversity and ground-truth accuracy of real data collection without its prohibitive costs. Our experiments demonstrate that ET models trained with our synthesized data outperform previous zero-shot methods, achieving higher accuracy and robustness.

[CV-53] VEELA: A Clinically-Constrained Benchmark for Liver Vessel Segmentation in Computed Tomography Angiography

链接: https://arxiv.org/abs/2605.22357
作者: Ziya Ata Yazıcı,N. Sinem Gezer,İlkay Öksüz,İlker Özgür Koska,Tuğçe Toprak,Pervin Bulucu,Ufuk Beşenk,A. Emre Kavur,Pierre-Henri Conze,Hazım Kemal Ekenel,Oğuz Dicle,Mustafa Ege Şeker,Mustafa Said Kartal,Ariorad Moniri,Orhan Özkan,Osman Faruk Bayram,Hakan Polat,Musa Balcı,Ece Tuğba Cebeci,Baran Cılga,Kardelen Peçenek,M. Alper Selver
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 25 figures, 5 tables

点击查看摘要

Abstract:Accurate segmentation of hepatic and portal vessels in contrast-enhanced computed tomography angiography (CTA) remains challenging due to complex vascular topology, peripheral visibility limitations, and acquisition-induced ambiguities. While existing public datasets offer valuable benchmarks, few include clinically realistic annotation constraints. We introduce VEELA (Vessel Extraction and Extrication for Liver Analysis), a rigorously curated liver vessel dataset derived from 40 CTA scans inherited from the CHAOS grand-challenge cohort. All vessels were manually delineated slice-by-slice under multi-expert consensus, using a strict visibility-driven annotation policy and avoiding anatomically inferred interpolation. This design explicitly captures anatomical variability and imaging-related uncertainty. As a continuation of the CHAOS challenge, VEELA enables reproducible cross-benchmark evaluation while extending the scope to fine-grained hepatic and portal vessel segmentation. We further establish a standardized benchmarking framework and analyze complementary evaluation metrics, including topology-aware (clDice), overlap-based (IoU), boundary-sensitive (NSD), and geometry-aware (area, length) measures. Our results demonstrate that different metrics capture distinct aspects of vascular integrity, underscoring the necessity of multi-perspective evaluation for clinically meaningful vessel segmentation. VEELA is publicly released to facilitate reproducible research and support the development of robust vascular segmentation methods. Researchers can access the evaluation metrics, dataset, and submission platform at this https URL.

[CV-54] QuantSR: Pushing the Limit of Quantized Image Super-Resolution Networks

链接: https://arxiv.org/abs/2605.22351
作者: Haotong Qin,Xudong Ma,Xianglong Liu,Jie Luo,Jinyang Guo,Michele Magno,Yulun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-bit quantization is widely used to compress super-resolution (SR) models and reduce storage and computation costs for deployment on resource-limited devices. However, when SR models are pushed to ultra-low precision (2-4 bits), performance can drop sharply due to diminished representational capacity and the detail-sensitive nature of SR. To address these issues, we propose QuantSR+, a unified framework that improves quantization operators, network design, and training optimization, achieving better trade-offs between accuracy and efficiency than prior low-bit SR methods. QuantSR+ mainly relies on three technical contributions: (1) Redistribution-driven Bit Determination (RBD), which reshapes quantization distributions in both forward and backward passes to preserve representation fidelity; (2) Quantized Slimmable Architecture (QSA), which begins with an over-parameterized model and progressively prunes less critical blocks to meet efficiency budgets while pushing the accuracy performance; and (3) Slimming-guided Function-localized Distillation (SFD), which enforces block-aware feature alignment via a direct loss and a progressive, function-local training schedule to capture quantization effects better and speed up convergence. Extensive experiments show that QuantSR+ achieves state-of-the-art performance against both specialized quantized SR methods and generic quantization approaches. For SwinIR-S on Urban100 (x4), it improves PSNR by 0.29 dB over the 2-bit SOTA baseline. Meanwhile, it delivers strong efficiency gains at 2-bit, reducing operations by up to 87.9% and storage by 89.4%. QuantSR+ is effective for both convolutional and transformer-based SR models, indicating broad applicability.

[CV-55] Bernini: Latent Semantic Planning for Video Diffusion

链接: https://arxiv.org/abs/2605.22344
作者: Bernini Team:Chenchen Liu,Junyi Chen,Lei Li,Lu Chi,Mingzhen Sun,Zhuoying Li,Yi Fu,Ruoyu Guo,Yiheng Wu,Ge Bai,Zehuan Yuan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM’s pretrained understanding translating into strong generalization on challenging editing tasks.

[CV-56] 4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting

链接: https://arxiv.org/abs/2605.22342
作者: Sifan Zhou,Hang Zhang,Yuhang Wang,Ming Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages main paper, 7 figures, 18 pages in total

点击查看摘要

Abstract:While 4D Gaussian Splatting (4DGS) has revolutionized high-fidelity dynamic reconstruction, safeguarding the intellectual property of these assets remains an open challenge. Conventional steganographic techniques often neglect the underlying kinematic manifolds, triggering non-physical artifacts such as severe temporal flickering and “FVD collapse”. To address this, we propose \textbf4D-GSW, a kinematic-aware watermarking framework designed to embed robust copyright information while preserving high spatio-temporal consistency. Unlike prior 4D steganography that primarily focuses on opacity-guided invisibility, our approach explicitly addresses the physical coherence of motion trajectories. We introduce a \textbfSpatio-Temporal Curvature (STC) metric to identify “Dynamic Instants,” adaptively gating watermark gradient injection to shield critical motion manifolds from non-physical perturbations. To ensure global coherence across complex deformations, we formulate a joint \textbfHMM-MRF energy minimization model that synchronizes watermark phases within both temporal trajectories and spatial neighborhoods. Furthermore, an \textbfanisotropic gradient routing mechanism ensures that watermark embedding remains strictly decoupled from photometric reconstruction fidelity. Extensive experiments have demonstrated the superior performance of our method in robustly hiding watermarks while resisting various attacks and maintaining high rendering quality and spatiotemporal consistency.

[CV-57] 3D LULC classification using multispectral LiDAR and deep learning: current and prospective schemes

链接: https://arxiv.org/abs/2605.22328
作者: Narges Takhtkeshha,Aldino Rizaldy,Markus Hollaus,Juha Hyyppä,Fabio Remondino,Gottfried Mandlburger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Land Use Land Cover (LULC) classification is essential for national 3D mapping, geospatial analysis, and sustainable planning. Multispectral (MS) LiDAR provides synchronized spatial-spectral information, and deep learning (DL) enables 3D point cloud semantic segmentation; however, adoption is limited by the lack of publicly available urban and suburban MS LiDAR datasets aligned with National Mapping and Cadastral Agencies (NMCAs) classification schemes. This study addresses these gaps by introducing L1 and L2 NMCA-aligned LULC classification schemes and a new benchmark MS LiDAR dataset. We evaluate seven state-of-the-art DL models and perform spectral ablation studies at both levels of detail. Results show that Point Transformer V3 achieves the best performance, with mIoU of 79.4% (L1, 8 classes) and 58.9% (L2, 20 classes) using a dual-wavelength LiDAR system (532 nm and 1064 nm). Ablation results show that multispectral information improves performance over geometry-only inputs, with gains of 1.1 percentage points at L1 and 7.8 points at L2. These results highlight the value of LiDAR reflectance for fine-grained material discrimination and support the evolution of NMCA LULC schemes toward higher semantic detail. The Loosdorf-MSL dataset contributes a new benchmark for consistent national and international LULC mapping.

[CV-58] Robustness of breast lesion segmentation under MRI undersampling improves with k-space-aware deep learning

链接: https://arxiv.org/abs/2605.22327
作者: Lukas T. Rotkopf,Marco Schlimbach,Julius C. Holzschuh,Heinz-Peter Schlemmer,Jens Kleesiek,Moritz Rempe
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Purpose: To assess whether breast lesion segmentation can be learned directly from acquired MRI k-space, and whether doing so improves robustness when data are accelerated or noisy. Materials and Methods: This retrospective study used public breast dynamic contrast-enhanced MRI (DCE-MRI) datasets with acquired and synthetic k-space, together with a within-dataset synthetic control. We compared four 3D U-Net variants: a hybrid k-space-to-image model, a native k-space model, and magnitude and complex image-space baselines. Models were evaluated under increasing undersampling and added complex Gaussian k-space noise. The primary outcome was patient-level Dice similarity coefficient under cross-validation, with the hybrid model prespecified as the main comparison against the magnitude image-space baseline. Results: At full sampling, the hybrid and image-space models performed similarly. As acceleration increased, the hybrid model retained substantially more segmentation accuracy and significantly outperformed the magnitude image-space baseline across moderate to high undersampling levels. The same pattern was observed when noise was added directly to k-space: the hybrid model degraded more slowly, whereas the image-space baseline failed under heavier noise. This advantage was reproduced in the within-dataset synthetic control. Feature analysis suggested that the k-space stage and image-space stage played complementary roles, with frequency-domain filtering concentrated before image-domain lesion localization. Conclusion: K-space-aware deep learning improves the robustness of breast lesion segmentation under MRI undersampling and k-space noise, while matching image-space methods at full sampling. Subjects: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2605.22327 [cs.CV] (or arXiv:2605.22327v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.22327 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lukas Thomas Rotkopf [view email] [v1] Thu, 21 May 2026 11:18:26 UTC (3,934 KB)

[CV-59] PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

链接: https://arxiv.org/abs/2605.22311
作者: Jose Edgar Hernandez Cancino Estrada,Mauro Díaz Lupone,Žiga Emeršič,Vitomir Štruc,Peter Peer,Darian Tomašević
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identity-conditioned diffusion models enable high-quality and identity-consistent face generation, but they also raise severe privacy concerns, as models may continue to synthesize individuals despite their right to be forgotten. While machine unlearning has been extensively studied for concept and data removal, identity unlearning remains largely unexplored, particularly in models conditioned directly on identity embeddings rather than text prompts. In this work, we study identity unlearning in Arc2Face, a state-of-the-art identity-conditioned latent diffusion model for face generation, and introduce Proximity-guided Identity Unlearning (PIU), an anchor-guided framework for identity unlearning. Specifically, we formulate identity removal as an identity replacement objective that reassigns the source identity to a selected anchor identity in the learned identity space, and we complement it with a proximity-based anchor selection strategy motivated by the geometry of ArcFace representations. We further show that effective unlearning can be achieved through localized fine-tuning of a small subset of identity-sensitive cross-attention layers. Experiments across many target identities show that our framework effectively suppresses generation of the target identity while preserving realism and identity consistency for retained identities, as validated by improved performance on unlearning and image-quality metrics, together with qualitative evaluation. The source code for the PIU framework is publicly available at this https URL .

[CV-60] Detection of Virus and Small Cell Patches in Foci Images Using Switchable Convolution and Feature Pyramid Networks

链接: https://arxiv.org/abs/2605.22290
作者: Amrita Singh,Snehasis Mukherjee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection and counting of virus patches in focus-forming unit (FFU) images, also known as foci images, are important for quantifying viral infection and analyzing cellular structures. This task is challenging because biomedical targets often vary substantially in size, density, contrast, and shape. In this paper, we propose an enhanced YOLOv2-based detector that integrates a Feature Pyramid Network (FPN) to improve multi-scale feature representation. We also incorporate a switchable atrous convolution mechanism to adapt the receptive field for fine-grained targets in dense microscopy images. The proposed method is evaluated on biomedical foci image datasets for virus patch and small cell patch detection. For small cell patch detection, the model achieves a mean average precision (mAP) of 40.5% at a 25% Intersection over Union (IoU) threshold. For FFU virus patch detection, the model achieves an mAP of 68%. These results indicate that combining FPN-based feature fusion with switchable convolution improves the suitability of YOLOv2 for specialized biomedical object detection tasks

[CV-61] Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

链接: https://arxiv.org/abs/2605.22273
作者: Xiang Chen,Yuxian Dong,Chao Li,Chengyin Hu,Jiaju Han,Fengyu Zhang,Yiwei Wei,Jiahuan Long,Jiujiang Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.

[CV-62] Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

链接: https://arxiv.org/abs/2605.22272
作者: Jiahe Chen,ZiRui Wang,Feiyu Jia,Xiao Chen,Xiaojie Niu,Weishuai Zeng,Tianfan Xue,Xiaowei Zhou,Jiangmiao Pang,Jingbo Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textitRepresentation Misalignment due to their reliance on geometric priors (e.g., explicit CAD models), and \textitRetargeting Complexity arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker’s search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

[CV-63] MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering CVPR’26

链接: https://arxiv.org/abs/2605.22269
作者: Junbin Xiao,Jiajun Chen,Tianxiang Sun,Xun Yang,Angela Yao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: To appear at CVPR’26. Code is available at this https URL

点击查看摘要

Abstract:Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

[CV-64] Impact of Atmospheric Turbulence and Pointing Error on Earth Observation

链接: https://arxiv.org/abs/2605.22268
作者: Celia Sánchez-de-Miguel,Antonio M. Mercado-Martínez,Beatriz Soret,Antonio Jurado-Navas,Miguel Castillo-Vázquez
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Conference

点击查看摘要

Abstract:Earth Observation (EO) imagery is often degraded by atmospheric turbulence and pointing jitter; yet, these effects are rarely considered in datasets used to train AI-based detection models. Based on prior work, this paper presents an enhanced image simulator that enables the incorporation of vertical-path atmospheric turbulence and satellite pointing jitter, arising from platform and sensor vibrations, to generate physically realistic distorted images. As a case study, vessel detection is evaluated using YOLOv8 and RetinaNet on images generated by the proposed simulator under different levels of turbulence and pointing errors. Results show that YOLOv8 recall decreases from 91% under ideal conditions to 60% in the presence of weak turbulence, and falls below 40% under strong turbulence or jitter. In contrast, RetinaNet demonstrates greater robustness, maintaining approximately 75% recall across degraded conditions. These results highlight the importance of incorporating realistic physical degradations into EO training datasets to ensure reliable performance of AI-based models in operational environments, as demonstrated in maritime surveillance applications.

[CV-65] An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion

链接: https://arxiv.org/abs/2605.22259
作者: Jan Nausner,Michael Hubner
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 1 figure; \c{opyright} 2026 The Authors. Submitted to the 2026 IEEE International Conference on Multisensor Fusion and Integration (MFI 2026). Under review

点击查看摘要

Abstract:Heterogeneous sensor fusion is vital for detecting, localizing, and classifying CBRNE threats. However, individual sensors are often only capable of detecting a subset of relevant threats with varying reliability or can even provide only indirect threat indications, making threat classification challenging. Furthermore, high clutter rates on the sensor side present a great challenge for fusion systems. Additionally, the limited availability of high quality datasets hinders the advancement of learning-based detection and classification models in smart sensors. To mitigate these sensor related shortcomings, a context-aware and domain knowledge-enhanced fusion process is proposed. First, a novel evidence hierarchy is established that enables modeling of direct, indicative, and contextual information. Second, contextual information about the environment is introduced into the fusion process, by collecting, processing, and exploiting OSINT inputs. Third, all levels of the evidence hierarchy are used to craft a Bayesian threat type classification mechanism with domain knowledge-informed priors. The proposed methodology is evaluated in simulated scenarios, and the results demonstrate the benefit of the proposed fusion approach in terms of robustness to clutter and prior mismatch, with an overall classification accuracy of up to 95%.

[CV-66] D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities

链接: https://arxiv.org/abs/2605.22249
作者: Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.

[CV-67] REACH: Hand Pose Estimation from Room Corners

链接: https://arxiv.org/abs/2605.22231
作者: Shu Nakamura,Ryo Kawahara,Genki Kinoshita,Ryosuke Hirai,Yasutomo Kawanishi,Shohei Nobuhara,Ko Nishino
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel 3D hand pose estimator that can accurately recover the shape and pose of people’s hands in a room from afar, typically from fixed cameras at room corners, in extremely low-resolution and frequently occluded views. Our key idea is to fully leverage hand-body coordination, its temporal progression, and multiview observations. We achieve this with a novel Transformer-based model, in which hand and body configurations are modeled through correlations between their visual features expressed as per-view tokens, and their temporal coordination is exploited in an autoregressive manner. We introduce a novel dataset, which we refer to as REACH, Room-Environment dataset Annotated with Chest cameras for Hand pose estimation, to train and test our method. REACH is a first-of-its-kind large-scale hand pose dataset that captures accurate hand movements of 50 participants across a wide variety of daily activities. In order to avoid interfering with natural movements while annotating the hands with accurate shape and pose, we leverage concealed chest cameras. Through extensive experiments, including comparative studies with existing methods, we show that our model, REACH-Net, achieves highly accurate 3D hand pose estimation from afar. These results broaden the horizon of 3D hand pose estimation, especially towards “in-the-wild” continuous human behavior analysis.

[CV-68] A Robust Semantic Segmentation Pipeline for the CVPR 2026 8th UG2 Challenge Track 2

链接: https://arxiv.org/abs/2605.22216
作者: Jinming Chai,Libo Yan,Licheng Jiao,Fang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report presents our solution for the WeatherProof Dataset Challenge, namely CVPR 2026 8th UG2+ Challenge Track 2: Semantic Segmentation in Adverse Weather. For the semantic segmentation task under adverse weather conditions, we propose a semi-supervised segmentation pipeline. Our method is trained exclusively on the WeatherProof dataset, without using any additional external data. Specifically, we adopt UniMatch V2 as the baseline model and treat all degraded-weather images as unlabeled data for semi-supervised training, thereby fully exploiting the data distribution provided by the challenge. During inference, we further apply test-time augmentation to improve the robustness and segmentation accuracy of the final predictions. The code is publicly available at: this https URL.

[CV-69] GALAR-TemporalNet v2: Anatomy-Guided Dual-Branch Temporal Classification with Bidirectional Mamba and Dual-Graph GCN for Video Capsule Endoscopy – after competition results ICPR2026

链接: https://arxiv.org/abs/2605.22209
作者: Jiye Won(1),Seangmin Lee(1),Soon Ki Jung(1) ((1) Kyungpook National University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures. Post-competition preprint for the ICPR 2026 RARE-VISION Challenge

点击查看摘要

Abstract:Video Capsule Endoscopy (VCE) poses a challenging multi-label temporal classification problem, requiring simultaneous localization of 8 anatomical regions and detection of 9 pathological findings across tens of thousands of frames. We present GALAR-TemporalNet v2, a hierarchical temporal model that addresses three core challenges: extreme class imbalance, long-range temporal dependencies, and pathology–anatomy entanglement. Our architecture combines windowed self-attention for local modeling, a Dual-Graph GCN for global frame relationships, and Bidirectional Mamba for selective boundary context encoding. A novel anatomy prototype residual pathway decouples pathological deviation signals from normal organ appearance, and a frame-level GCN skip connection stabilizes training of visually confusable rare classes. The competition version, GALAR-TemporalNet, achieved an overall mAP@0.5 of 0.2644 and mAP@0.95 of 0.2353 on the RARE-VISION test set. Following the competition, the redesigned GALAR-TemporalNet v2 – incorporating a restructured pathology branch, refined loss functions, and extended post-processing – improved these results to mAP@0.5 of 0.3409 and mAP@0.95 of 0.3333.

[CV-70] EvoIR-Agent : Self-Evolving Image Restoration Agent ic System via Experience-Driven Learning

链接: https://arxiv.org/abs/2605.22208
作者: Kailin Zhuang,Jiawei Wu,Zhi Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Model (MLLM)-driven image restoration agent demonstrates effectiveness in degradation coupling scenarios by flexibly selecting tools and determining removal orders. However, their zero-shot planning often fails without experience, necessitating severe trial-and-error overhead to achieve satisfactory outcomes. Currently, two paradigms are employed to address this issue, yet a dilemma persists: Training-based methods embed intrinsic experience into parameters, achieving high inference efficiency but lacking compatibility with new tools or degradation. In contrast, training-free methods utilize explicit experience storage for compatibility but still incur trial-and-error overhead due to naive experience. To resolve the dilemma, we propose EvoIR-Agent, which first systematically formulates the experience components of a training-free image restoration agent. Subsequently, a hierarchical experience pool is constructed, which enables coarse-to-fine guidance for diverse tools and removal orders. Furthermore, a self-evolving mechanism is introduced to update the pool from scratch using accumulated records, thereby greatly improving performance and efficiency. Extensive experiments reveal that EvoIR-Agent achieves a significant lead in the full reference metrics and yields a remarkable Pareto-optimal balance between performance and efficiency compared to the state-of-the-art methods.

[CV-71] Zero-Shot Temporal Action Localization Through Textual Guidance

链接: https://arxiv.org/abs/2605.22201
作者: Benedetta Liberatori,Alessandro Conti,Lorenzo Vaquero,Paolo Rota,Yiming Wang,Elisa Ricci
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to FG 2026

点击查看摘要

Abstract:Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos’’ (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

[CV-72] OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

链接: https://arxiv.org/abs/2605.22200
作者: Hanna Hoffmann,Setareh Bady,Claas de Boer,Max Kirchner,Jan Egger,Rainer Röhrig,Frank Hölzle,Lennart Johannes Gruber,Kunpeng Xie,Marlon Neuhaus,Victor Alves,Guilherme Barbosa,Leonardo Barroso,João Carvalho,Hao Chen,Gabriella d’Albenzio,André Ferreira,Nuno Gomes,Yuichiro Hayashi,Kousuke Hirasawa,Rebecca Hisey,Seungjae Hong,Seoi Jeong,Tiago Jesus,Daehong Kang,Satoshi Kasai,Shunsuke Kikuchi,Takayuki Kitasaka,Satoshi Kondo,Hyoun-Joong Kong,Youngbin Kong,Atsushi Kouno,Shlomi Laufer,Kyu Eun Lee,Bining Long,Nooshin Maghsoodi,Hiroki Matsuzaki,Evangelos Mazomenos,Ori Meiraz,Kensaku Mori,Marina Music,Masahiro Oda,Roi Papo,Jieun Park,Rafael Piexoto,Saeid Rezaei,Mariana Ribeiro,Soyeon Shin,Yang Shu,Idan Smoller,Danail Stoyanov,Yihui Wang,Xinkai Zhao,Sebastian Bodenstedt,Isabel Funke,Stefanie Speidel,Behrus Hinrichs-Puladi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA

点击查看摘要

Abstract:Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact. Comments: Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.22200 [cs.CV] (or arXiv:2605.22200v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.22200 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-73] Ultra-High-Definition Image Quality Assessment via Graph Representation Learning

链接: https://arxiv.org/abs/2605.22192
作者: Shaode Yu,Enqi Chen,Ming Huang,Xuemin Ren,Songnan Zhao,Zhicheng Zhang,Qiurui Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind image quality assessment (BIQA) for ultrahighdefinition (UHD) images remains challenging because native-resolution inference is computationally expensive, whereas aggressive resizing or isolated cropping may suppress scale-sensitive distortions and weaken the relationship between local artifacts and global scene context. This paper aims to improve UHD-BIQA by explicitly modeling the structural dependencies among sampled image regions rather than treating them as independent views, and a graph representation learning framework UHD-GCN-BIQA is proposed. The framework samples aspect-ratio-aligned patches from each UHD image, encodes them as graph nodes, and constructs a hybrid k-nearest-neighbor graph using spatial proximity and feature similarity. Residual graph convolution is used to propagate contextual information across regions, and gated attention pooling aggregates patchlevel evidence into an imagelevel quality prediction. An exponential moving average normalized multiobjective loss function is adopted to stabilize the joint optimization of regression, correlation, and ranking objectives. Experiments on the UHD-IQA benchmark show that UHD-GCN-BIQA achieves PLCC = 0.7784, SRCC = 0.8019, and RMSE = 0.0519, obtaining competitive correlation performance and the lowest RMSE among the compared methods. These results indicate that graph-based region relation modeling is effective for UHD image quality assessment, particularly for improving absolute quality score estimation under high-resolution visual content.

[CV-74] No Pose No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

链接: https://arxiv.org/abs/2605.22190
作者: Matteo Balice,Yanik Kunzi,Chenyangguang Zhang,Matteo Matteucci,Marc Pollefeys,Sungwhan Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.

[CV-75] Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

链接: https://arxiv.org/abs/2605.22186
作者: Senyan Xu,Zhijing Sun,Kean Liu,Xin Lu,Ruixuan Jiang,Mingyang Huang,Xueyang Fu,Zheng-Jun Zha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations. Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset. Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at this https URL.

[CV-76] Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis ITSC2026

链接: https://arxiv.org/abs/2605.22185
作者: Tomaso Trinci,Henrique Piñeiro Monteagudo,Leonardo Taccari
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

[CV-77] Balancing Uncertainty and Diversity of Samples: Leverag ing Diversity of Least High Confidence Samples for Effective Active Learning

链接: https://arxiv.org/abs/2605.22169
作者: Vipul Arya,S.H. Shabbeer Basha,Srikrishna U N,Sunainha Vijay,Snehasis Mukherjee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model’s performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at this https URL.

[CV-78] ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLM s ICLR2026

链接: https://arxiv.org/abs/2605.22158
作者: Bingjun Luo,Tony Wang,Chaoqi Chen,Xinpeng Ding
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in this https URL.

[CV-79] Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution

链接: https://arxiv.org/abs/2605.22147
作者: Jiangwei Mo,Xi Lu,Hanlin Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40–1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.

[CV-80] One Sentence One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

链接: https://arxiv.org/abs/2605.22144
作者: Yufei Shi,Weilong Yan,Naixuan Huang,Yucheng Chen,Chenyu Zhang,Tao He,Si Yong Yeo,Ming Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user’s single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience’s immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

[CV-81] EventGait: Towards Robust Gait Recognition with Event Streams

链接: https://arxiv.org/abs/2605.22139
作者: Senyan Xu,Shuai Chen,Chuanfu Shen,Kean Liu,Zhijing Sun,Chengzhi Cao,Xueyang Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity of conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition. Therefore, we propose \textbfEventGait, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structure Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets are released at this https URL.

[CV-82] Accelerating Vision Foundation Models with Drop-in Depthwise Convolution ICPR2026

链接: https://arxiv.org/abs/2605.22132
作者: Carmelo Scribano,Mohammad Mahdi,Nedyalko Prisadnikov,Yuqian Fu,Giorgia Franchini,Danda Pani Paudel,Marko Bertogna,Luc Van Gool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2026

点击查看摘要

Abstract:Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

[CV-83] AesFormer: Transform Everyday Photos into Beautiful Memories ICML2026

链接: https://arxiv.org/abs/2605.22126
作者: Tianxiang Du,Hulingxiao He,Yuxin Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo’s aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at this https URL.

[CV-84] MotionDPS: Motion-Compensated 3D Brain MRI Reconstruction

链接: https://arxiv.org/abs/2605.22121
作者: Antonio Ortiz-Gonzalez,Erich Kobler,Lukas Schletter,Alexander Effland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is highly susceptible to patient motion due to its relatively long acquisition times and the fact that data are acquired sequentially in k-space. Even small patient movements introduce phase inconsistencies across measurements, leading to severe artifacts such as blurring, ghosting, and geometric distortions that can compromise diagnostic quality. Retrospective motion compensation remains challenging, particularly in accelerated acquisitions, due to the ill-posed nature of the joint reconstruction and motion estimation problem. In this work, we propose a unified Bayesian framework for motion-compensated 3D MRI that jointly estimates the anatomical image, rigid-body motion parameters, and coil sensitivity maps directly from motion-corrupted k-space data. Our approach integrates pretrained 3D complex-valued score-based diffusion models as expressive anatomical image priors within a physics-based forward model. Inference is performed by alternating diffusion posterior image updates with efficient proximal optimization steps for motion and coil sensitivity estimation, enabling fully unsupervised reconstruction without the need for paired motion-free training data. Experiments on simulated and real-motion brain MRI datasets demonstrate that the proposed method achieves improved image quality and motion robustness compared to state-of-the-art classical and learning-based motion correction techniques, particularly in the presence of severe motion and high acceleration.

[CV-85] Perception or Prejudice: Can MLLM s Go Beyond First Impressions of Personality?

链接: https://arxiv.org/abs/2605.22109
作者: Caixin Kang,Tianyu Yan,Sitong Gong,Mingfang Zhang,Liangyang Ouyang,Ruicong Liu,Bo Zheng,Huchuan Lu,Kaipeng Zhang,Yoichi Sato,Yifei Huang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

[CV-86] OPERA: An Agent for Image Restoration with End-to-End Joint Planning -Execution Optimization

链接: https://arxiv.org/abs/2605.22104
作者: Feng Zhu,Shuyang Xie,Yihan Zeng,Ming Liu,Wangmeng Zuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.

[CV-87] xtTeacher: What Can Language Teach About Images?

链接: https://arxiv.org/abs/2605.22098
作者: Tobias Christian Nauen,Stanislav Frolov,Brian Bernhard Moser,Federico Raue,Ahmed Anwar,Andreas Dengel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at TMLR

点击查看摘要

Abstract:The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: this https URL Comments: Published at TMLR Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T05 (Primary), 68T45 (Secondary) ACMclasses: I.2.6; I.2.10 Cite as: arXiv:2605.22098 [cs.CV] (or arXiv:2605.22098v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.22098 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Transactions on Machine Learning Research, ISSN 2835-8856, 2026

[CV-88] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection – after competition results

链接: https://arxiv.org/abs/2605.22096
作者: Bo-Cheng Qiu,Fang-Ying Lin,Ming-Han Sun,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

[CV-89] LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

链接: https://arxiv.org/abs/2605.22089
作者: Xiaodong Mei,Diankun Zhang,Hongwei Xie,Guang Chen,Hangjun Ye,Dan Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

[CV-90] GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

链接: https://arxiv.org/abs/2605.22086
作者: Zhiqing Hong,Zelong Li,Xiubin Fan,Guang Yang,Baoshen Guo,Haotian Wang,Tian He,Desheng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) has shown remarkable effectiveness in various applications, such as smart healthcare and intelligent manufacturing. However, a major challenge faced by HAR is the distribution shift across different sensor data domains, which often leads to decreased performance when deployed for real-world applications. To address this issue, this paper introduces GenHAR, a novel framework designed to mitigate the domain gap by learning domain-invariant sensor representations. GenHAR aims to enhance the generalization capabilities of HAR on target domains purely with data from the source domain. The key novelty of GenHAR lies in two aspects. Firstly, GenHAR tokenizes sensor data and learns correlations among frequency sensor channel dimensions to improve the robustness of HAR models. Secondly, GenHAR improves the efficiency via selective masking and an efficient attention mechanism. We conduct a systematic analysis of GenHAR by comparing it with state-of-the-art HAR methods on real-world human activity datasets. Results show that GenHAR outperforms state-of-the-art methods by 9.97% in accuracy, and reduces Floating Point Operations by 6.4 times. Moreover, we deploy GenHAR at a leading logistics company in 4 cities, and have detected 2.15 billion real-time activities. We release our code at: this https URL.

[CV-91] JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

链接: https://arxiv.org/abs/2605.22080
作者: Yue Xun,Junyu Liu,Qian Niu,Xinyi Wang,Zheng Yuan,Zirui Li,Zequn Zhang,Bowen Zhao,Shujun Wang,Irene Li,Kan Hatakeyama-Sato,Yusuke Iwasawa,Yutaka Matsuo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

[CV-92] Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding ICLR2026

链接: https://arxiv.org/abs/2605.22078
作者: Bingjun Luo,Tony Wang,Hanqi Chen,Xinpeng Ding
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in this https URL.

[CV-93] WINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting CVPR2025

链接: https://arxiv.org/abs/2605.22069
作者: Hyeseong Kim,Geonhui Son,Deukhee Lee,Dosik Hwang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025, Project page: this https URL

点击查看摘要

Abstract:Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

[CV-94] COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

链接: https://arxiv.org/abs/2605.22068
作者: Junhyub Lee,Seunghun Chae,Hyosu Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at this https URL.

[CV-95] Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

链接: https://arxiv.org/abs/2605.22066
作者: Yanan Liu,Qinya Li,Hao Zhang,Kangjian He,Xuan Yang,Hao Li,Dan Xu,Lei Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentally challenged by geometric ambiguity and temporal discontinuity. To tackle these issues, we propose Echo4DIR, a novel test-time 4D implicit reconstruction framework. Specifically, we learn robust 3D shape priors from statistical shape models (SSMs) via a cardiac conditional SDF, constructing an Epipolar Mask Encoder module with epipolar cross attention to effectively fuse multi-view features. To bridge the synthetic-to-real domain gap, we introduce a self-supervised SDF-tailored differentiable rendering strategy for patient-specific 3D shape adaptation using uncalibrated clinical masks without requiring 3D ground truth. Crucially, the inherent continuity of implicit representation overcomes sparse observations, enabling anatomically reliable geometry at arbitrary resolutions. Furthermore, to empower our framework with physically continuous 4D extension, we introduce a Radial SDF Alignment strategy that strictly locks shape evolution to the predicted velocity field, fundamentally eliminating mesh drift. Extensive experiments on synthetic benchmarks and real clinical datasets demonstrate that Echo4DIR achieves state-of-the-art 4D cardiac mesh reconstruction, notably yielding an impressive clinical overlap of up to 98.35% Dice and 96.75% IoU.

[CV-96] Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates CVPR2026

链接: https://arxiv.org/abs/2605.22061
作者: Guojun Xu,Mingyang Zhang,Jianwen Xiang,Cheng Tan,Yanchao Yang,Junwei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates ( 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

[CV-97] EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation SIGGRAPH2026

链接: https://arxiv.org/abs/2605.22051
作者: Yue Ma,Xu Ye,Qinghe Wang,Yucheng Wang,Hongyu Liu,Yinhan Zhang,Xinyu Wang,Yuanpeng Che,Shanhui Mo,Paul Liang,Fangneng Zhan,Qifeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2026. Project page: this https URL

点击查看摘要

Abstract:Generating high-fidelity visual effects (VFX) typically demands massive datasets and prohibitive computational power due to the intricate coupling of spatial textures and temporal dynamics. In this paper, we introduce EasyVFX, a resource-efficient framework that achieves realistic VFX synthesis under stringent constraints. Our core philosophy lies in frequency-domain decomposition: we observe that the complexity of VFX can be significantly mitigated by decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics. This spectral disentanglement transforms a high-dimensional learning problem into manageable sub-tasks, thereby lowering the optimization barrier and reducing data dependency. Building upon this insight, we propose a two-stage training paradigm. First, we design a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture. By utilizing a soft routing mechanism, our model assigns specialized experts to distinct spectral bands, enabling them to cultivate robust priors for appearance and motion dynamics. This specialization allows the model to acquire foundational VFX knowledge with fewer GPU resources. Second, we introduce a Test-Time Training strategy powered by a novel Frequency-constraint Loss. This allows the pre-trained model to swiftly adapt to specific, unseen effects through localized optimizations, requiring only about 100 steps on a single GPU. Experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects, proving that frequency-aware learning is a key catalyst for democratizing professional-grade VFX.

[CV-98] Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations KDD2026

链接: https://arxiv.org/abs/2605.22050
作者: Yuanmin Huang,Mi Zhang,Chen Chen,Feifei Li,Geng Hong,Xiaoyu You,Min Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: KDD 2026, extended version

点击查看摘要

Abstract:While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken’’ artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC 0.999 detection performance and a 0.0% memorization rate after mitigation with negligible overhead ( \approx0.01 s per image).

[CV-99] Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin MICCAI2026

链接: https://arxiv.org/abs/2605.22044
作者: Mengxiao Wang,Yilin Lyu,Julia Camps,Ching Hui Sia,Mark Yan-Yee Chan,Yanrui Jin,Shuzhi Sam Ge,Chengliang Liu,Lei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early-accepted by MICCAI 2026. This version corresponds to the submitted version. The final version will be available on Springer Link

点击查看摘要

Abstract:Accurate localization of myocardial infarction is essential for risk stratification. While LGE-MRI remains the gold standard, it is resource-intensive. Integrating cine MRI with ECG enables a more detailed representation of infarct properties. Existing inverse MI inference methods overlook realistic scar morphology and cardiac repolarization, reducing sensitivity to subtle ECG variations and interpretability of infarct-induced electrophysiological changes. In this paper, we propose a novel framework for noninvasive MI localization using cardiac digital twins. To bridge the domain gap between simulation and reality, we introduce an anatomy-aware stochastic infarct synthesis strategy to synthesize realistic, irregular scars with border zones, mimicking ischemic transmural progression. We then construct a virtual cohort to simulate QRS-T waveforms, capturing both depolarization and repolarization dynamics. Furthermore, we design a Physiology and Anatomy Aware Network (PAA-Net) that jointly encodes 3D myocardial geometry and multi-lead ECGs to infer infarct areas with varying localizations, sizes, spatial extents, and transmuralities. Experimental results demonstrate that our framework significantly outperforms existing methods in inverse inference, achieving Dice scores of 0.7391 and 0.5503 for scar and border zone segmentation, respectively, while further enhancing the interpretability of the ECG-infarct relationship. Our code will be released upon acceptance.

[CV-100] GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

链接: https://arxiv.org/abs/2605.22036
作者: Jiahao Yang,Zihan Wang,Xiangyang Li,Xing Zhu,Yujun Shen,Yinghao Xu,Shuqiang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

[CV-101] AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

链接: https://arxiv.org/abs/2605.22034
作者: Haocheng Li,Juepeng Zheng,Zenghao Yang,Kaiqi Du,Guilong Xiao,Gengmeng Pu,Haohuan Fu,Jianxi Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 45 pages,12 figures

点击查看摘要

Abstract:Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbfAgroVG, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10,071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set- F_1 reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at this https URL .

[CV-102] SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

链接: https://arxiv.org/abs/2605.22031
作者: Pengcheng Fang,Hongli Chen,Fangfang Tang,Feng Liu,Xiaohao Cai,Shanshan Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accelerated MRI reconstruction requires recovering missing details while preserving anatomically coherent structures across large spatial regions. State-space models such as Mamba provide efficient long-range modeling, making them attractive learned regularizers for unrolled reconstruction. However, in a data-consistency-coupled unrolled solver, different stages operate on different reconstruction iterates, where the resident carrier should preserve coherent reconstruction content across stages while stage-dependent non-resident evidence is tied to the current update. Treating these roles uniformly can place persistent resident-carrier evidence and update-dependent non-resident evidence into the same recurrent content route. We therefore propose SO-Mamba, a state-ownership Mamba regularizer that assigns reconstruction evidence within each Mamba stage to recurrent residency, state-interface access, and non-state output correction. SO-Mamba implements this ownership rule with a State-Ownership Router (SOR), which constructs a resident carrier for recurrent content and routes non-resident evidence to affine modulation of the B/C state interfaces and an output correction outlet. The resident carrier supplies the Mamba content route, while the non-resident evidence stream adapts the state interfaces and contributes through the output outlet without entering the recurrent content route. We further introduce a two-level outer-band leakage diagnostic that separates hidden-state storage from readout expression by measuring outer-band energy in the selective-scan state trajectory and the post-scan Mamba readout. Experiments on five public MRI reconstruction benchmarks spanning diverse anatomies, sampling patterns, and coil configurations show that SO-Mamba consistently improves over CNN-, Transformer-, and Mamba-based baselines with competitive computational efficiency.

[CV-103] ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

链接: https://arxiv.org/abs/2605.22020
作者: Yuke Li,Weihang Liu,Cheng Zhang,Yuefeng Zhang,Jiadi Cui,Zixuan Wang,Junran Ding,Haoyu Wu,Yujiao Shi,Jingyi Yu,Xin Lou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (3DGS) models offer fast single-pass reconstruction,but scaling them to match per-scene optimization quality is fundamentally hindered by the scarcity of large-scale 3D annotations.A practical compromise is predict-then-refine,where post-prediction optimization compensates for the limited capacity of the feed-forward this http URL,standard feed-forward 3DGS is trained solely for zero-step rendering error,ignoring whether its output constitutes a good initialization for the downstream this http URL present ForeSplat,an optimization-aware training framework that equips feed-forward 3DGS models to produce initializations explicitly designed for rapid,effective this http URL offloading part of the scene-modeling burden to the optimizer,ForeSplat substantially reduces the capacity pressure on the feed-forward model,making high-quality reconstruction feasible even with compact this http URL its core is MetaGrad,a lightweight multi-anchor meta-gradient training rule that bypasses costly higher-order differentiation through the 3DGS this http URL unrolls a short inner-loop refinement trajectory,samples anchor states,and back-propagates aggregated first-order gradients to the prediction head as a surrogate optimization-aware this http URL fine-tuning adds no inference cost and enables high-quality reconstruction within seconds after a few refinement this http URL instantiate ForeSplat on diverse backbones,including AnySplat,Pi3X,and a distilled variant tailored for edge this http URL all tested architectures,a ForeSplat-trained initialization converges in fewer refinement steps and reaches a higher peak reconstruction quality than its vanilla counterpart,even fully this http URL framework consistently bridges the gap between amortized prediction and per-scene optimization,establishing a practical path toward lightweight,high-fidelity 3D reconstruction.

[CV-104] FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

链接: https://arxiv.org/abs/2605.22018
作者: Connor Malone,Sebastien Demmel,Sebastien Glaser
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 ^\circ point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle’s data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.

[CV-105] Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction CVPR

链接: https://arxiv.org/abs/2605.22017
作者: Lei Chu,Yuhuan Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MEIS-- CVPR

点击查看摘要

Abstract:Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

[CV-106] ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

链接: https://arxiv.org/abs/2605.22015
作者: Hangyeol Lee,Joo-Young Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU. Subjects: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR) Cite as: arXiv:2605.22015 [cs.CV] (or arXiv:2605.22015v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.22015 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-107] PointLLM -R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

链接: https://arxiv.org/abs/2605.22013
作者: Chaoqi Chen,Qile Xu,Wenjun Zhou,Hui Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.

[CV-108] Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

链接: https://arxiv.org/abs/2605.22011
作者: Hangyeol Lee,Hyojeong Lee,Joo-Young Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.22011 [cs.CV] (or arXiv:2605.22011v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.22011 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-109] ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

链接: https://arxiv.org/abs/2605.22002
作者: Joao Batista Florindo,Amanda Pontes de Oliveira Ornelas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Biomedical image segmentation is a critical task in medical diagnosis and treatment planning, enabling precise delineation of anatomical structures and pathological regions. Despite significant advancements, challenges persist due to the inherent variability, noise, and complex morphology present in diverse medical imaging modalities. This paper introduces ConvNeXt-FD, a novel deep learning architecture for robust biomedical image segmentation, built upon a U-Net-like encoder-decoder framework leveraging the powerful ConvNeXt backbone. Our approach integrates a hybrid loss function combining the Dice coefficient with a boundary-aware regularization term inspired by a differentiable formulation of Fractal Dimension, designed to enhance the model’s sensitivity to object boundaries and shape fidelity. We rigorously evaluate ConvNeXt-FD across six distinct biomedical datasets: BUSI (Breast Ultrasound Images), DDTI (Thyroid Ultrasound Images), FluoCells (Fluorescent Cell Images), IDRiD (Diabetic Retinopathy Images for Optic Disc Segmentation), ISIC2018 (Skin Lesion Images), and MoNuSeg (Nuclei Segmentation). Experimental results demonstrate that ConvNeXt-FD, particularly when initialized with ImageNet pre-trained weights, achieves competitive and often superior performance compared to existing state-of-the-art methods across various metrics, including Dice, Jaccard, Accuracy, Sensitivity, Specificity, and False Positive Rate. The integration of ConvNeXt as a strong encoder, coupled with the boundary-aware regularization, proves effective in capturing both high-level semantic features and fine-grained boundary details, leading to more accurate and reliable segmentations in challenging biomedical contexts.

[CV-110] Virtual 3D HE Staining from Phase-contrast Back-illumination Interference Tomography

链接: https://arxiv.org/abs/2605.22000
作者: Anthony Song,Boyan Zhou,Mayank Golhar,Marisa Morakis,Alex Baras,Nicholas Durr
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable HE images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic HE volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual HE staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: this https URL.

[CV-111] Learning Spatiotemporal Sensitivity in Video LLM s via Counterfactual Reinforcement Learning

链接: https://arxiv.org/abs/2605.21988
作者: Dazhao Du,Jian Liu,Jialong Qin,Tao Han,Bohai Gu,Fangqi Zhu,Yujia Zhang,Eric Liu,Xi Chen,Song Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbfCounterfactual Relational Policy Optimization (CRPO), a dual-branch RL framework for improving \emphspatiotemporal sensitivity. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbfCounterfactual Relation Reward (CRR) between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbfDyBench, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at this https URL .

[CV-112] RiT: Vanilla Diffusion Transformers Suffice in Representation Space

链接: https://arxiv.org/abs/2605.21981
作者: Le Zhang,Ning Mang,Aishwarya Agrawal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow matching with x -prediction – regressing the clean data point rather than the ambient velocity – is known to exploit low-dimensional manifold structure effectively in pixel space \citeli2025back. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both \hatd!\approx!33 ) yet DINOv2 exhibits 7.3\times higher effective rank, 35\times better covariance conditioning, 11.5\times lower excess kurtosis, and 1.7\times lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emphRepresentation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x -prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt[CLS]-patch modeling. On ImageNet 256\times256 , RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT ^\textDH -XL with 19% fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, 5 Heun steps already reach FID 2.0 and 10 steps reach 1.25, without distillation or consistency training. Code at this https URL.

[CV-113] Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow ICML2026

链接: https://arxiv.org/abs/2605.21980
作者: Chengsheng Zhang,Chenghao Sun,Zhining Xie,Xinmei Tian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute’’ mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

[CV-114] Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

链接: https://arxiv.org/abs/2605.21977
作者: Zhengcen Li,Chenyang Jiang,Liangxu Su,Tong Shao,Shiyang Zhou,Ming Tao,Jingyong Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

[CV-115] Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding ICML2026

链接: https://arxiv.org/abs/2605.21973
作者: Zelin Zheng,Xinyan Liu,Ruixin Li,Antoni B. Chan,Guorong Li,Qingming Huang,Laiyun Qing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

[CV-116] Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection

链接: https://arxiv.org/abs/2605.21964
作者: Xuquan Wang,Guishuo Yang,Dapeng Yan,Yujie Xing,Xuanyu Qian,Kai Zhang,Xiong Dun,Jiande Sun,Zhanshan Wang,Xinbin Cheng
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 15 pages, 11 figures; supplementary material: 3 pages, 2 figures

点击查看摘要

Abstract:Computational imaging enables compact infrared systems, but deep-learning pipelines that combine image reconstruction and object detection often introduce substantial inference latency. Most existing acceleration strategies compress the reconstruction network while overlooking physical priors from the optical path, leaving a trade-off between accuracy and speed. We present Physics-aware Dual-Integrated Network (PDI-Net), a low-latency framework that integrates infrared reconstruction with object detection and further embeds optical priors into the learning process. PDI-Net uses a supervised U-Net during training, while a semi-U-Net encoder shares features directly with a YOLO-based detector during inference, avoiding full image reconstruction. To bridge the gap between fidelity-oriented reconstruction features and detection-oriented semantics, we introduce a physics-aware large-small bridge (PALS-Bridge), which uses field-dependent point spread function priors to adaptively modulate multiscale convolutional branches. A physics-informed optical degradation simulation pipeline is also developed for training and validation. The method is deployed on a single-lens infrared camera, reducing system weight by about 50% compared with traditional multi-lens designs. On the M3FD benchmark under low-SNR conditions, PDI-Net reduces inference time by 84.06% compared with the Rec+Det with pruning strategy while improving mAP@0.5:0.95 by 5.07%. These results demonstrate compact, low-latency computational infrared imaging for real-time object detection on resource-constrained platforms.

[CV-117] Bounding-Box Trajectories Matter for Video Anomaly Detection

链接: https://arxiv.org/abs/2605.21957
作者: Inpyo Song,Jangwon Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.

[CV-118] MLLM s Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

链接: https://arxiv.org/abs/2605.21954
作者: Dazhao Du,Liao Duan,Jian Liu,Tao Han,Yujia Zhang,Eric Liu,Xi Chen,Song Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Website: this https URL

点击查看摘要

Abstract:Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emphTemporal Grounding Heads (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at this https URL.

[CV-119] EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

链接: https://arxiv.org/abs/2605.21931
作者: Shiqi Huang,Ziyue Wang,Zhongrong Zuo,Han Qiu,Qi She,Bihan Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose \textbfEvoVid , a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

[CV-120] Visual-Advantage On-Policy Distillation for Vision-Language Models

链接: https://arxiv.org/abs/2605.21924
作者: Ruiqi Liu,Xiaolei Lv,Gengsheng Li,Ximo Zhu,Zhiheng Wang,Zhengbo Zhang,Junkai Chen,Zhiheng Li,Bo Li,Jun Gao,Shu Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student’s output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student’s predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher’s predictions depend heavily on this http URL make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language this http URL propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

[CV-121] SDGBiasBench: Benchmarking and Mitigating Vision–Language Models Biases in Sustainable Development Goals

链接: https://arxiv.org/abs/2605.21919
作者: Zihang Lin,Huaiyuan Qin,Muli Yang,Hongyuan Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision–Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

[CV-122] MAVEN: A Multi-stage Agent ic Annotation Pipeline for Video Reasoning Tasks CVPR2026

链接: https://arxiv.org/abs/2605.21917
作者: Han Zhang,Wanting Jiang,Tomasz Kornuta,Tian Zheng,Vidya Murali
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026 Workshop

点击查看摘要

Abstract:Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream QA generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a +38.8 -point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by +10.7 MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

[CV-123] Multi-scale interaction network for stereo image super-resolution

链接: https://arxiv.org/abs/2605.21913
作者: Liyi Xu,Lin Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo image super-resolution aims to generate high-resolution images by leveraging complementary information from binocular systems. Although previous studies have achieved impressive results, the potential of intra-view and cross-view information has not been fully exploited. To address this issue, we propose a novel multi-scale interaction network for stereo image super-resolution. Specifically, we design a Multi-scale Spatial-Channel Attention Module that utilizes multi-scale large separable kernel attention and simple channel attention to improve intra-view feature extraction. Additionally, we propose a Dual-View Epipolar Attention Module, utilizing an optimal transport algorithm to achieve more accurate matching along the epipolar line. Extensive experimental and ablation studies show that our method achieves competitive results that outperform most SOTA methods.

[CV-124] Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

链接: https://arxiv.org/abs/2605.21907
作者: Gang Dai,Yining Huang,Yiming Xia,Guohao Chen,Shuaicheng Niu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

[CV-125] Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

链接: https://arxiv.org/abs/2605.21906
作者: Yuheng Li,Yuan Gao,Haoyu Dong,Yuxiang Lai,Shansong Wang,Mojtaba Safari,James E. Baciak,Xiaofeng Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Code is available at this https URL

[CV-126] hermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

链接: https://arxiv.org/abs/2605.21882
作者: Rusiru Thushara,Yasiru Ranasinghe,Jay Paranjape,Vishal M. Patel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo’s pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We also introduce a pixel-aligned RGB-thermal instruction-tuning dataset and Thermo-VL-Bench, a manually screened RGB-thermal VQA benchmark for low-light and cross-spectrum reasoning. Experiments show strong gains on challenging thermal-only and RGB+thermal reasoning tasks, highlighting the value of prompt-conditioned multispectral fusion. Our dataset and code are publicly available at: this https URL

[CV-127] Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models KDD2026

链接: https://arxiv.org/abs/2605.21861
作者: Yuting He,Chenyu You,Shuo Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at this https URL.

[CV-128] CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.21854
作者: Zhi Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Workshop draft, 14 pages, 4 figures. Code, ckpts, data: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) – the de-facto post-training step in language models – has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) – per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial – with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling – both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at this https URL.

[CV-129] Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset Benchmark and Models for Seizure Semiology Understanding ICML2026

链接: https://arxiv.org/abs/2605.21852
作者: Lina Zhang,Tonmoy Monsoor,Peizheng Li,Jiarui Cui,Xinyi Peng,Chong Han,Prateik Sinha,Siyuan Dai,Jessica Nichole Pasqua,Colin M McCrimmon,Weiting Liu,Hailey Marie Miranda,Bing Hu,Xiangting Wu,Tengyou Xu,Chunhan Li,Jiaye Tian,Jiarui Tang,Detao Ma,Lingye Kong,Junnan Lyu,Jungang Li,Yan Zan,Junhua Huang,Rajarshi Mazumder,Vwani Roychowdhury
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026 as a Spotlight presentation

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatio-temporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite, a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.

[CV-130] SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

链接: https://arxiv.org/abs/2605.21788
作者: Xuefei Sun,Xujia Zhang,Brendan Crowe,Doncey Albin,Christoffer Heckman
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

[CV-131] BodyReLux: Temporally Consistent Full-Body Video Relighting SIGGRAPH2026

链接: https://arxiv.org/abs/2605.21766
作者: Li Ma,Mingming He,Xueming Yu,David M. George,Ahmet Levent Taşel,Paul Debevec,Julien Philip
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Siggraph 2026 Journal Track. Project page: this https URL

点击查看摘要

Abstract:Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.

[CV-132] Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

链接: https://arxiv.org/abs/2605.21747
作者: Steven Chen,Shivesh Khaitan,Nemanja Djuric
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

点击查看摘要

Abstract:We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle’s make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

[CV-133] AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

链接: https://arxiv.org/abs/2605.21714
作者: Ziyi Kou,Ankit Kumar,Mia Huang,Taylor Niehues,Vatsal Mehta,Ergys Ristani,Li Guan
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model’s sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

[CV-134] MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

链接: https://arxiv.org/abs/2605.21669
作者: Jinghang Li,Tales Santini,Courtney Clark,Bruno de Almeida,Cong Chu,Salem Alkhateeb,Andrea Sajewski,Jacob Berardinelli,Hecheng Jin,Tobias Campos,Jeremy J. Berardo,Joseph Mettenburg,Ariel Gildengers,Howard J. Aizenstein,Minjie Wu,Tamer S. Ibrahim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus \epsilon^2 = 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: this https URL

[CV-135] Hierarchical Variational Policies for Reward-Guided Diffusion

链接: https://arxiv.org/abs/2605.21661
作者: Kushagra Pandey,Farrin Marouf Sofian,Jan Niklas Groeneveld,Felix Draxler,Stephan Mandt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality–speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

[CV-136] Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

链接: https://arxiv.org/abs/2605.21652
作者: Yue Zhou,Erxuan Wu,Yikang Sun,Hongjoo Lee,Yuan Bi,Huixiong Xu,Zhongliang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer’s cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3%, demonstrating that our model has learned the ability to actively look closer and diagnose.

[CV-137] Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

链接: https://arxiv.org/abs/2605.21642
作者: Tianyi Zhang,Mahtab Bigverdi,Ranjay Krishna
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support “visual thinking.” Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning – gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

[CV-138] UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

链接: https://arxiv.org/abs/2605.21611
作者: Jiayun Wang,Yu Wang,Weijie Gan,Zhenting Wang,Wei Wei
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2605.21611 [cs.CV] (or arXiv:2605.21611v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.21611 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-139] GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

链接: https://arxiv.org/abs/2605.21605
作者: Sixiang Chen,Zhaohu Xing,Tian Ye,Xinyu Geng,Yunlong Lin,Jianyu Lai,Xuanhua He,Fuxiang Zhai,Jialin Gao,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-ended image generation is no longer a simple prompt-to-image problem. High-quality generation often requires an agent to combine a model’s internal generative ability with external resources. As requests become more diverse and demanding, we aim to develop a general image-generation agent that can self-evolve through trajectories and use tools more effectively across varied generation challenges. To this end, we propose GenEvolve, a self-evolving framework based on Tool-Orchestrated Visual Experience Distillation. In GenEvolve, each generation attempt is modeled as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing agentic generation methods that mainly rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience, provided only to a privileged teacher branch. Inspired by on-policy self-distillation, Visual Experience Distillation provides dense token-level supervision, helping the student internalize better search, knowledge activation, reference selection, and prompt construction. We further construct GenEvolve-Data and GenEvolve-Bench. Experiments on public benchmarks and GenEvolve-Bench show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. Our website is as follows: this https URL

[CV-140] Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models MICRO

链接: https://arxiv.org/abs/2605.21573
作者: Dong Chen,Fangyun Wei,Ziyu Wan,Dongdong Chen,Jiawei Zhang,Jinjing Zhao,Sirui Zhang,Yang Yue,Zhiyang Liang,Baining Guo,Chong Luo,Jianmin Bao,Ji Li,Lei Shi,Qinhong Yang,Xiuyu Wu,Xuelu Feng,Yan Lu,Yanchen Dong,Yitong Wang,Yunuo Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.

[CV-141] PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid Deformable and Articulated Objects

链接: https://arxiv.org/abs/2605.21572
作者: Ziang Cao,Yinghao Liu,Haitian Li,Runmao Yao,Fangzhou Hong,Zhaoxi Chen,Liang Pan,Ziwei Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

[CV-142] Dont Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

链接: https://arxiv.org/abs/2605.21493
作者: Rahul D Ray
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

[CV-143] me-varying rPPG signal separation via block-sparse signal model ICIP2026

链接: https://arxiv.org/abs/2605.22425
作者: Kosuke Kurihara,Yoshihiro Maeda,Daisuke Sugimura,Takayuki Hamamoto
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE International Conference on Image Processing (ICIP 2026)

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact measurement of cardiac pulse signals by analyzing subtle color changes in facial videos. Nevertheless, extracting rPPG signals remains challenging because of their extremely weak signal strength and susceptibility to illumination noise. In this paper, we propose an rPPG signal extraction method that exploits the quasi-periodic characteristics of rPPG signals. Our approach models quasi-periodicity of the rPPG signal, which arises from the stable cardiac cycle, as a block-sparse structure in the time-frequency domain. To incorporate a block-sparse model and enable adaptive signal separation under illumination fluctuations, we construct a time-varying signal separation framework. Experiments using a public dataset demonstrate the effectiveness of our method.

[CV-144] Entropy-Guided Self-Supervised Learning for Medical Image Classification

链接: https://arxiv.org/abs/2605.21970
作者: Joao Florindo,Viviane Moura
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.

[CV-145] An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation

链接: https://arxiv.org/abs/2605.21835
作者: Xiaofeng Liu,Qianru Zhang,Thibault Marin,Menghua Xia,Chi Liu,Georges El Fakhri,Jinsong Ouyang
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Code available at: this https URL

点击查看摘要

Abstract:The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task-specific, are often trained on single-center cohorts, or adopt dual-branch fusion schemes that delay cross-modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open-source, multi-center, whole-body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet-shaped backbones with early channel-wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero-mean imputation, combined with a weighted global reconstruction loss. This design avoids non-physical intensity discontinuities at masked-region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5-shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated-modality pretraining. This multi-center foundation model demonstrates label efficiency and cross-modality representation learning for PET/CT tumor segmentation. It provides a robust, open-source basis for advancing automated oncologic imaging, significantly reducing the need for large-scale manual annotations in clinical practice.

[CV-146] Mapping Tomato Cropping Systems in California Using AlphaEarth Geospatial Embeddings and Deep Learning Analysis

链接: https://arxiv.org/abs/2605.21804
作者: Mohammadreza Narimani,Alireza Pourreza,Parastoo Farajpoor
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, 1 table. Preprint submitted to ASABE 2026 AIM

点击查看摘要

Abstract:Field-scale crop maps support supply-chain forecasting and policy, yet statewide crop identification still often depends on retrospective surveys or remote-sensing workflows built around hand-engineered spectral features. Those pipelines can be accurate, but they require repeated preprocessing and often lose robustness across years. This study evaluated whether Google DeepMind’s AlphaEarth geospatial embeddings can serve as an analysis-ready alternative for mapping processing tomato systems in California. LandIQ 2018 crop polygons were used to assemble a balanced reference dataset of 4,742 tomato and 4,742 non-tomato fields. For each polygon, 64-band AlphaEarth embedding chips were extracted and aligned with binary masks, then divided into spatially independent training (n = 6,638), validation (n = 1,422), and test (n = 1,424) sets. A U-Net segmentation model was trained on AWS SageMaker using a composite masked binary cross-entropy and soft Dice loss. To complement hard predictions, Monte Carlo dropout was retained at inference and repeated 100 times per chip to estimate predictive mean and variance. On the independent test set, the model achieved 99.19% pixel accuracy, 98.69% precision, 99.40% recall, 99.04% F1 score, 98.11% intersection over union, and 99.02% chip accuracy. Uncertainty maps were consistently highest near field edges and low within field interiors. The results show that AlphaEarth embeddings retain crop-relevant spatial and temporal structure and can support accurate, field-scale tomato mapping without manual feature engineering.

[CV-147] HyperBench: Standardizing and Scaling Synthetic Evaluation for Hyperspectral Super-Resolution

链接: https://arxiv.org/abs/2605.21671
作者: Ritik Shah,Marco F. Duarte
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral super-resolution (HSR) reconstructs a high-spatial-resolution hyperspectral image by fusing a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI). In the absence of real-world paired data, HSR methods are evaluated almost exclusively on synthetic experiments derived from hyperspectral datasets through Wald’s protocol. Despite the protocol’s widespread adoption, its practical implementation varies markedly across research works, typically relying on a single (usually Gaussian) or very few point spread functions (PSFs), one or two spectral response functions (SRFs), and a couple of spatial downsampling factors. As a result, reported performance figures are difficult to compare across the literature, in addition to being often difficult to reproduce; furthermore, they may not generalize across realistic sensing conditions. We introduce HyperBench, a unified and extensible framework that standardizes synthetic experimentation for HSR. HyperBench supports diverse degradation configurations spanning ten PSFs, four SRFs derived from operational multispectral sensors, configurable spatial downsampling factors, and matched additive white Gaussian noise; its goal is to automate large-scale evaluation and structured logging. By decoupling model development from experimental design, the framework enables reproducible, apples-to-apples cross-method comparison with minimal friction. We use HyperBench to evaluate six recently proposed HSR methods across a 70-configuration sweep on four widely used hyperspectral scenes and observe that the inter-method PSNR spread widens from approximately 5 dB on the easiest PSF to over 13 dB on the hardest - a fragility that is structurally invisible to the prevailing single-configuration evaluation protocol. HyperBench code is available at this https URL .

[CV-148] VRXU-net: A Deep Learning Approach for Brain Ischemic Stroke Lesion Detection and Segmentation in T1W MRI

链接: https://arxiv.org/abs/2605.21633
作者: Sayed Amir Mousavi Mobarakeh
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When the blood supply to the brain is obstructed by a clot, oxygen delivery to brain tissues becomes insufficient, leading to cellular necrosis. In healthcare settings, accurately identifying and delineating ischemic lesion boundaries is essential for treatment and surgical planning. However, ischemic stroke lesions vary widely in shape, size, and location, and in grayscale MRI modalities such as T1W they may resemble surrounding brain structures. This makes lesion detection and segmentation a challenging task for clinicians. This study introduces a novel VRU-Net architecture, derived from visual features, residual connections, and a U-shaped network, for detecting and segmenting ischemic stroke lesions in 3D magnetic resonance imaging scans. The proposed method first uses a modified VGG model to identify ischemic stroke in separate 2D slices. Then, a U-shaped segmentation model with residual blocks segments the lesion in each slice. This procedure is applied independently to the axial, sagittal, and coronal planes, and the final output is generated by aggregating the three segmentation results. To improve both performance and processing speed, a high-performance classifier is applied before the segmentation model in a sequential framework. This strategy reduces unnecessary segmentation of non-lesion slices and improves overall accuracy. In addition, decomposing 3D images into 2D slices reduces model complexity while allowing information from three anatomical planes to support more accurate lesion localization. The proposed model is trained on the Anatomical Tracings of Lesions After Stroke dataset and outperforms state-of-the-art models in terms of accuracy and Dice coefficient. Moreover, the segmentation output provides feedback that helps the classification model reduce false-positive predictions.

[CV-149] CryoNet: A Deep Learning Framework for Multi-Modal Debris-Covered Glacier Mapping. A Case Study of the Poiqu Basin Central Himalaya

链接: https://arxiv.org/abs/2605.21527
作者: Farzaneh Barzegar,Tobias Bolch,Norbert Kuehtreiber,Silvia L. Ullo
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 10 figures, 5 tables. Preprint submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS); currently under review

点击查看摘要

Abstract:Glaciers play a critical role as freshwater reserves and indicators of climate change, yet their automatic delineation, especially for debris-covered glaciers, remains challenging due to spectral similarity with surrounding terrain. This study introduces CryoNet, a deep learning framework that leverages a rich multi-modal dataset combining Sentinel-2 optical imagery, DEM-derived topographic variables, spectral indices, Principal Component Analysis (PCA), InSAR coherence and phase, tasseled-cap features, and GLCM texture to discriminate clean-ice glaciers, debris-covered glaciers, and glacial lakes. CryoNet is an encoder-decoder CNN with nested skip connections and spatial-channel Squeeze-and-Excitation (scSE) attention, built upon a ResNet101 encoder to capture hierarchical contextual and spatial features. The study is conducted in the Poiqu Basin in the central Himalaya, and transferability is evaluated by applying the trained model to the Mont Blanc Massif in the Alps. We additionally analyse the importance of each data layer in improving glacier mapping performance. The proposed model achieves an overall IoU of 90.52%, mean Recall of 98.08%, and mean Precision of 92.26%. For debris-covered glaciers specifically, CryoNet obtains an IoU of 90.46%, a recall of 95.79%, and a precision of 94.21%. Across both per-class and overall metrics, CryoNet surpasses DeepLabV3+, SegFormer, and U-Net, taken as state-of-the-art (SOTA) references, demonstrating its effectiveness for robust glacier mapping in complex high-mountain environments.

[CV-150] ackle CSM in JPEG Steganalysis with Data Adaptation

链接: https://arxiv.org/abs/2605.21523
作者: Rony Abecidan(CRIStAL),Vincent Itier(IMT Nord Europe, CRIStAL),Jérémie Boulanger(CRIStAL),Patrick Bas(CRIStAL),Tomáš Pevný(CTU)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Signal Processing (eess.SP)
备注: ACM Workshop on Information Hiding and Multimedia Security, (IHMMSec '26), Jun 2026, Florence, Italy

点击查看摘要

Abstract:Steganalysis models excel on benchmark datasets but struggle in the wild when analyzed images are produced by a processing pipeline unseen during training. This problem known as Cover Source Mismatch (CSM) is particularly hard in realistic settings where practitioners (1) have access to only a small, unlabeled dataset, (2) are unsure of the processing techniques applied to these images, and (3) lack information on the proportion of covers and stegos in that set. To answer this challenge, we introduce TADA (Target Alignment through Data Adaptation), a framework learning to emulate the unknown processing pipeline from a small unlabeled target set. This architecture is trained with a loss combining residual covariance alignment, residual distribution matching, and a \ell^2 loss constraining the emulator to produce realistic images. Across toy and operational targets, TADA yields substantial gains in robustness to CSM and improves operational generalization compared to strong holistic and atomistic baselines. Additional resources are available at this link: this https URL

[CV-151] A Task-Agnostic Algebraic Integrity Metric for Event-Camera Streams Toward SOTIF-Compliant Perception using Pearson Correlation Coefficient

链接: https://arxiv.org/abs/2605.21500
作者: Arthur de Miranda Neto
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 3 tables, 14 equations. Theoretical framework paper with procedural-synthetic illustrations; empirical validation on real datasets reserved for follow-up. Code and demonstration video available

点击查看摘要

Abstract:Event cameras have emerged as a high-bandwidth, low-latency sensing modality for safety-critical perception in automated driving systems (ADS), offering microsecond temporal resolution, 120-140 dB dynamic range, and intrinsic absence of motion blur. However, no task-agnostic quality metric currently operates directly on the asynchronous event stream: state-of-the-art proxies require a downstream task (e.g., detection accuracy, tracking error) to assess stream integrity, which is incompatible with the certification requirements of ISO 21448 (SOTIF) and ISO/PAS 8800:2024. The recent BiasBench benchmark (CVPR 2025) explicitly identifies this gap. This work proposes a unified algebraic framework that lifts the Pearson Correlation Coefficient (PCC), historically used in two prior works for redundancy filtering and ROI selection on frame-based images, to the three standard event representations: Time Surface, Event Frame, and Voxel Grid. The framework yields three metrics: (i) r-TS for stream integrity monitoring against an ego-motion-predicted Time Surface, (ii) r2-EF for adaptive ROI selection requiring only integer comparisons, and (iii) r-VG for temporal redundancy gating. A structural isomorphism is established between the contrast-threshold mechanism of the event camera (|Delta L| = C) and the PCC-based change criterion, the three lifted metrics are formalized, and pipeline latency and information loss are analyzed symmetrically against the raw stream. Illustrative behavior of each metric is demonstrated on a procedural-synthetic event stream, generated by direct simulation of the emission model rather than drawn from any real or video-derived dataset, including a tunnel-dip integrity-anomaly scenario in which r_C drops from 0.93 (coherent flow) to below 0 (alarm). An explicit epistemic convention ([ESTABLISHED], [SOLID], [HYPOTH.], [OPEN]) delineates the status of every contribution.

人工智能

[AI-0] he Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning ATC

链接: https://arxiv.org/abs/2605.22800
作者: Vishal Rajput
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 54 pages. 13 empirical task blocks. Companion software: matching-pmh (PyPI; this https URL ). Related arXiv note: 2604.21395 (geometric blind spot / isotropic PMH)

点击查看摘要

Abstract:Robustness, domain adaptation, photometric and occlusion invariance, compositional generalisation, temporal robustness, alignment safety, and classical anisotropic regularisation are usually treated as separate problems with separate method families. This paper argues that much of their shared structure is one statistical problem: estimate the covariance of label-preserving deployment nuisance, then regularise the encoder Jacobian along a matrix whose range covers that covariance (the matching principle). CORAL, adversarial training, IRM, augmentation, metric learning, Jacobian penalties, and alignment-style constraints are different estimators of that object, not independent robustness tricks. In the linear-Gaussian model we prove closed-form optimality (Theorem A), including cube-root water-filling within the matched range; necessity of range coverage for quadratic Jacobian penalties (Theorem G); the same range dichotomy at deep global minima; and two falsification controls (Lemma C; Corollaries E), with seven conditional consistency lemmas (D1-D7) for estimation under standard identifiability assumptions. We introduce the Trajectory Deviation Index (TDI), a label-free probe of embedding sensitivity when task accuracy or Jacobian Frobenius norm is insufficient. Thirteen pre-registered blocks from classical ML through Qwen2.5-7B test the predicted matched, then isotropic, then wrong-W ordering on geometry and deployment drift; twelve pass, and the sole exception (Office-31) is an eigengap failure named before the run. At 7B scale, matched style-PMH improves selective honesty and preserves Style TDI where standard DPO degrades it. The contribution is naming the deployment nuisance covariance, stating what the regulariser must do, and supplying a closed-form falsifiable theory once that object is identified, not universality on every leaderboard. Comments: 54 pages. 13 empirical task blocks. Companion software: matching-pmh (PyPI; this https URL). Related arXiv note: 2604.21395 (geometric blind spot / isotropic PMH) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) MSC classes: 68T07, 62H10 ACMclasses: I.2.6; I.2.0 Cite as: arXiv:2605.22800 [cs.LG] (or arXiv:2605.22800v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22800 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vishal Rajput [view email] [v1] Thu, 21 May 2026 17:53:28 UTC (3,012 KB) Full-text links: Access Paper: View a PDF of the paper titled The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning, by Vishal RajputView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.AI stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-1] MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

链接: https://arxiv.org/abs/2605.22794
作者: Qianshu Cai,Yonggang Zhang,Xianzhang Jia,Wei Xue,Jun Song,Xinmei Tian,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, 2 tables. Preprint. Code: this https URL

点击查看摘要

Abstract:Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts – skill files, prompt configurations, memory schemas, workflow graphs – and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

[AI-2] Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

链接: https://arxiv.org/abs/2605.22791
作者: Ali Hatamizadeh,Yejin Choi,Jan Kautz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Gated DeltaNet-2 technical report; code at this https URL

点击查看摘要

Abstract:Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at this https URL.

[AI-3] DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

链接: https://arxiv.org/abs/2605.22781
作者: Yunpeng Dong,Jingkai He,Yuze Hou,Dong Du,Zhonghu Xu,Si Yu,Yubin Xia,Haibo Chen
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets. Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.22781 [cs.OS] (or arXiv:2605.22781v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2605.22781 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

链接: https://arxiv.org/abs/2605.22776
作者: Stanislav R. Kirpichenko,Andrei V. Konstantinov,Lev V. Utkin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival Diffusion Probabilistic Model (SDPM), a generative approach to continuous-time survival analysis. SDPM models the conditional distribution of the survival outcome, represented by the pair of observed time and censoring indicator, \mathbbP(T,\delta \mid \mathbfx) , using a denoising diffusion model. Under the assumption of conditionally independent censoring, conditional samples generated by the model can be transformed into survival function estimates using the Kaplan-Meier estimator. This formulation avoids parametric assumptions on the event-time distribution and does not require a discretization of the output time space. The model operates in a transformed target space, using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator. We evaluate SDPM on ten real survival datasets and compare it with five strong baselines, including tree-based, boosting-based, and neural survival models. Results show that SDPM achieves competitive predictive performance across C-index, integrated time-dependent AUC, and integrated Brier score. A study on synthetic Cox-Weibull data demonstrates that SDPM can recover the shape of an underlying continuous survival distribution more accurately than a strong nonparametric baseline when sufficiently many samples are generated. An ablation study confirms the importance of the proposed target-space transformations, which improve event-rate calibration, reduce invalid generated times, and provide consistent gains in predictive discrimination. Codes implementing the proposed model are publicly available.

[AI-5] Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

链接: https://arxiv.org/abs/2605.22773
作者: Yu Tang,Muhammad Zakwan,Efe Balta,John Lygeros,Alisa Rupenyan
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear programming solvers. This paper proposes an event-based \glsDRL approach to solve FJSP with random job arrivals. Specifically, we employ the Proximal Policy Optimization algorithm and use lightweight Multi-Layer Perceptrons to train the \glsDRL agent for minimizing the total completion time of all jobs. We design the state representation to be directly accessible from the environment, and limit the learning agent to selecting from among a set of well-established dispatching rules. Simulations show that our \glsDRL approach outperforms any of the individual dispatching rules on datasets with varying heterogeneity and job arrival rates. We benchmark our \glsDRL against an arrival-triggered mixed-integer linear programming solution and show that our method achieves good performance especially when the datasets are heterogeneous.

[AI-6] Advancing Mathematics Research with AI-Driven Formal Proof Search

链接: https://arxiv.org/abs/2605.22763
作者: George Tsoukalas,Anton Kovsharov,Sergey Shirobokov,Anja Surina,Moritz Firsching,Gergely Bérczi,Francisco J. R. Ruiz,Arun Suggala,Adam Zsolt Wagner,Eric Wieser,Lei Yu,Aja Huang,Miklós Z. Horváth,Andrew Ferrauiolo,Henryk Michalewski,Codrut Grosu,Thomas Hubert,Matej Balog,Pushmeet Kohli,Swarat Chaudhuri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The first three authors and the last author have equal contributions. The first three authors are in random order

点击查看摘要

Abstract:Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method’s ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

[AI-7] owards a General Intelligence and Interface for Wearable Health Data

链接: https://arxiv.org/abs/2605.22759
作者: Girish Narayanswamy,Maxwell A. Xu,A. Ali Heydari,Samy Abdel-Ghaffar,Marius Guerard,Kara Vaillancourt,Zhihan Zhang,Jake Garrison,Levi Albuquerque,Dimitris Spathis,Hong Yu,Hamid Palangi,Xuhai “Orson” Xu,David G.T. Barrett,Joseph Breda,Jed McGiffin,Yubin Kim,Yuwei Zhang,Naghmeh Rezaei,Samuel Solomon,Karan Ahuja,Tim Althoff,Jake Sunshine,Ming-Zher Poh,Benjamin Yetton,Ari Winbush,Nicholas B. Allen,James M. Rehg,Isaac Galatzer-Levy,Yun Liu,John Hernandez,Anupam Pathak,Conor Heneghan,Yuzhe Yang,Ahmed A. Metwally,Pushmeet Kohli,Mark Malhotra,Shwetak Patel,Xin Liu,Daniel McDuff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.

[AI-8] Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

链接: https://arxiv.org/abs/2605.22749
作者: Adis Alihodžić,Eva Tuba,Milan Tuba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern smart grids rely on dense measurement infrastructures, communication links, and intelligent field devices. Although this improves supervision and control, it also increases vulnerability to cyber-physical disruptions. Operators must distinguish physical incidents, such as faults or line disturbances, from malicious actions, such as false data injection or unauthorized command execution. This chapter investigates this problem using the well-known MSU/ORNL Power System Attack Dataset. The proposed method combines machine learning with genetic-algorithm-based feature selection. The objective is twofold: to classify attack and natural events accurately, and to determine whether a reduced set of physically informative PMU/IED measurements can support reliable detection. Several baseline models are evaluated, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees. The results show that tree-based ensemble models are the most effective for the considered dataset, with Extra Trees providing the strongest full-feature baseline. After feature selection, the GA + Extra Trees model reduces the clean PMU feature space from 112 attributes to an average of 27.4 attributes over five runs, while increasing macro-F1 from 0.9118 to 0.9212 and ROC-AUC from 0.9791 to 0.9837. These results indicate that many synchronized electrical measurements are redundant. A compact subset of phasor-based features can still provide accurate and interpretable anomaly detection in smart grids.

[AI-9] Proxy-Based Approximation of Shapley and Banzhaf Interactions

链接: https://arxiv.org/abs/2605.22738
作者: Santo M. A. R. Thies,Hubert Baniecki,R. Teal Witter,Eyke Hüllermeier,Maximilian Muschalik,Fabian Fumagalli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.

[AI-10] he Distillation Game: Adaptive Attacks Efficient Defenses

链接: https://arxiv.org/abs/2605.22737
作者: Youssef Allouah,Mahdi Haghifam,Sanmi Koyejo,Reza Shokri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive–adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: this https URL.

[AI-11] HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

链接: https://arxiv.org/abs/2605.22733
作者: Edwin Jose
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Every Python function deployed as an LLM tool must today exist in two forms: an HTTP endpoint for human-facing clients and CI pipelines, and an MCP tool registration for agent runtimes such as Claude and Cursor. These representations share business logic yet diverge in all the surrounding machinery (routing, validation, serialisation, streaming, and schema maintenance), and they drift apart as the underlying code evolves. We present HarnessAPI, a Python framework that eliminates this duplication by treating a typed skill folder as the single source of truth. From one this http URL plus Pydantic schemas, the framework automatically derives a streaming HTTP endpoint with Server-Sent Events, an interactive OpenAPI/Swagger UI, and a zero-configuration MCP tool, all served from a single process. Dual-mode content negotiation lets the same handler serve SSE-streaming and JSON-returning clients with no handler changes. A dynamic code-generation mechanism ensures Pydantic type annotations propagate correctly to FastMCP’s inspection layer, resolving a technical limitation that prevents naive closure-based registration. Measured across six representative skills using cloc, HarnessAPI reduces framework-facing boilerplate by 74% compared with a manually maintained dual-stack implementation (FastAPI server + FastMCP server). HarnessAPI subclasses FastAPI, inheriting its full middleware, dependency-injection, and deployment ecosystem. It is available at this https URL and on PyPI (pip install harnessapi)

[AI-12] Post-Training is About States Not Tokens: A State Distribution View of SFT RL and On-Policy Distillation

链接: https://arxiv.org/abs/2605.22731
作者: Dong Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

[AI-13] he Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler

链接: https://arxiv.org/abs/2605.22723
作者: Md Sahil Akhtar,Aymane El Gadarri,Vivek F. Farias,Adam D. Jozefiak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:A central error measure in Gaussian DDPMs is the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. This quantity is especially relevant for procedures such as classifier guidance, which perturb the entire reverse trajectory rather than only the terminal sample. Prior analyses show that standard isotropic reverse covariances suffer an unavoidable \Omega(1/T) path-KL error as the number of denoising steps T grows. We show that matching the full posterior covariance breaks this barrier, yielding an order-wise improvement that reduces the path KL to O(1/T^2) . To make full covariance matching practical, we introduce the Lanczos Gaussian sampler (LGS), a training-free, matrix-free method for sampling from the optimal reverse covariance using only covariance-vector products, which are available through Jacobian-vector products of the posterior mean. LGS avoids dense covariance storage and auxiliary covariance models. We prove that LGS approximation error decays exponentially in the number of Lanczos steps, where each Lanczos step requires a single Jacobian-vector product. Empirically, using only just three such steps improves sample quality over strong diagonal-covariance baselines, including OCM-DDPM, across standard image benchmarks. This identifies full covariance matching as both theoretically valuable and practically accessible for fast DDPM sampling.

[AI-14] Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

链接: https://arxiv.org/abs/2605.22717
作者: Zachary Novack,Stephen Brade,Haven Kim,Hugo Flores García,Nithya Shikarpur,Chinmay Talegaonkar,Suwan Kim,Valerie K. Chen,Julian McAuley,Taylor Berg-Kirkpatrick,Cheng-Zhi Anna Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a “generative delay” to transform musicians’ improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

[AI-15] Parametric Modular Answer Set Programs Made Declarative

链接: https://arxiv.org/abs/2605.22716
作者: Jorge Fandinno,Yuliya Lierler,Torsten Schaub
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: To appear in Theory and Practice of Logic Programming

点击查看摘要

Abstract:In this paper, we explore the concept of modularity in first-order answer set programming (ASP). We introduce a new formalism called parametric modular logic programs, which allows defining subprograms with parameters and intensionality statements. We demonstrate how this formalism can capture the semantics of clingo-programs with collective control, a feature that enables structuring and instantiating subprograms. We provide theoretical foundations for modular ASP, illustrate its usefulness, and connect to traditional non-modular ASP.

[AI-16] Abstraction for Offline Goal-Conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2605.22711
作者: Clarisse Wibault,Alexander Goldie,Antonio Villares,Maike Osborne,Jakob Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.

[AI-17] Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

链接: https://arxiv.org/abs/2605.22693
作者: Hoang-Dung Bui,Abhish Khanal,Raihan Islam Arnob,Gregory J. Stein
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous robot teams navigating partially known environments face costly backtracking when ground robots encounter blocked roads that are only revealed upon physical traversal. We address this with Scout-Assisted Planning, a heterogeneous planning framework in which scouting Unmanned Aerial Vehicles proactively gather environmental information to improve Unmanned Ground Vehicle navigation. To focus scouting on the most consequential edges, we propose Information Gain-based Action Pruning, which scores candidate scouting actions by their expected impact on ground robot behavior. Since exact Information Gain-based Action Pruning computation is prohibitively expensive, we develop a Graph Neural Network based model that predicts information gain values directly from graph structure and belief state, reducing planning time to real-time levels without sacrificing solution quality. Experiments across three environment types show that SAP with Information Gain Action Pruning reduces ground robot travel cost by 31.9–37.7% over the Canadian Traveler Problem baseline, and outperforms proximity-based scouting guidance by an additional 8–14%, confirming that principled information-gain-guided scouting is both more effective and computationally feasible for real-world deployment

[AI-18] Forecasting Scientific Progress with Artificial Intelligence

链接: https://arxiv.org/abs/2605.22681
作者: Sean Wu,Pan Lu,Yupeng Chen,Jonathan Bragg,Yutaro Yamada,Peter Clark,David Clifton,Philip Torr,James Zou,Junchi Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 73 pages, 13 figures, 29 tables

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

[AI-19] Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

链接: https://arxiv.org/abs/2605.22672
作者: Nick Merrill,Jaeho Lee,Ezra Karger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability–accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability–accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

[AI-20] WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

链接: https://arxiv.org/abs/2605.22664
作者: Thomson Yen,Julian Poeltl,Harshith Srinivas Gear,Yilin Meng,Joshua Fan,Adam Shen,Yili Liu,Ali Bauyrzhan,Siri Du,Haoyang Liu,Daniel Guetta,Hongseok Namkoong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

[AI-21] Claw AI Lab: An Autonomous Multi-Agent Research Team

链接: https://arxiv.org/abs/2605.22662
作者: Fan Wu,Cheng Chen,Zhenshan Tan,Taiyu Zhang,Xinzhen Xu,Yanyu Qian,Dingcheng Gao,Lanyun Zhu,Qi Zhu,Yi Tan,Deyi Ji,Guosheng Lin,Tianrun Chen,Deheng Ye,Fayao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project page and code are available at this https URL

点击查看摘要

Abstract:We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi-agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory-like in practice. A key practical contribution of Claw AI Lab lies in its Claw-Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability-aware scientific infrastructure.

[AI-22] AtelierEval: Agent ic Evaluation of Humans LLM s as Text-to-Image Prompters ICML2026

链接: https://arxiv.org/abs/2605.22645
作者: Hanjun Luo,Zhimu Huang,Sylvia Chung,Yiran Wang,Yingbin Jin,Jialin Li,Jiang Li,Xinfeng Li,Hanan Salam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

[AI-23] Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

链接: https://arxiv.org/abs/2605.22642
作者: Banghao Chi,Yining Xie,Mingyuan Wu,Jingcheng Yang,Jize Jiang,Zhaoheng Li,Shengyi Qian,Minjia Zhang,Klara Nahrstedt,Rui Hou,Xiangjun Fan,Hanchao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Mingyuan served as the project lead. Banghao, Yining, and Mingyuan contributed equally to this work, with more junior authors listed before senior authors. All data and code releases are maintained by the corresponding authors at UIUC and are not affiliated with Meta

点击查看摘要

Abstract:Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent’s performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507’s Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL’s strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work. Comments: Mingyuan served as the project lead. Banghao, Yining, and Mingyuan contributed equally to this work, with more junior authors listed before senior authors. All data and code releases are maintained by the corresponding authors at UIUC and are not affiliated with Meta Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.22642 [cs.AI] (or arXiv:2605.22642v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.22642 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

链接: https://arxiv.org/abs/2605.22634
作者: Ting Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to express more than task guidance: they must make goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules inspectable. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing this http URL files as readable task contracts while preserving lightweight skill discovery and progressive loading. The framework clarifies the boundary between contractual skills, GovernSpec YAML contracts, Model Context Protocol surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. We evaluate the framework with two offline experiments. A text-generation study covers three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight generation models, yielding 960 outputs and 1680 cross-judge score records. Contractual skills outperform no-skill and minimal-skill baselines on all tested models. Relative to information-rich plain expanded skills, the gains are small and mixed, suggesting that contractual fields mainly improve checkability and maintainability rather than raw generation quality. A tool-calling challenge covers eight models and 192 simulated tool-call records. Skills usually reduce high-risk tool attempts, but model differences remain and runtime tool guardrails are still required. The results suggest that contractual skills are best understood as a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism. Comments: 14 pages, 5 figures, 3 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.22634 [cs.SE] (or arXiv:2605.22634v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.22634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

链接: https://arxiv.org/abs/2605.22612
作者: Naveen Raman,Santiago Cortes-Gomez,Mateo Dulce Rubio,Fei Fang,Bryan Wilder
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation–deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.

[AI-26] Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms

链接: https://arxiv.org/abs/2605.22604
作者: Md Israfeel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The advent of cardless artificial intelligence (AI) banking heralds a paradigm shift in the financial landscape, offering users unprecedented security and convenience. This paper outlines a comprehensive framework designed to enhance cybersecurity, introduce auto-generated virtual cards, and mitigate fraud risks within cardless AI banking systems. The framework envisions a future banking architecture that employs AI-powered data cryptography to create secure virtual cards for seamless transactions. By emphasizing secure communication channels, it ensures the integrity of financial activities among banking systems, cardholders, and third-party vendors. AI-based authorization methodologies play a pivotal role in authenticating each transaction while proactively identifying potential fraud, demonstrating the framework’s efficacy in fortifying cardless AI banking security. The initial approach, featuring an AI-driven, feature-based banking system, ensures the generation of virtual cards with encrypted data, minimizing information exposure and reducing fraud risks. Integrating a machine learning algorithm adds an additional layer of protection against potential fraudulent activities. In conclusion, the proposed framework establishes a holistic cybersecurity and fraud-mitigation paradigm for cardless AI banking systems. Its implementation empowers financial institutions to address security concerns associated with traditional banking, paving the way for a future banking landscape that is not only fraud-resistant but also secure and convenient for users.

[AI-27] hink Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

链接: https://arxiv.org/abs/2605.22602
作者: Minghui Ma,Bin Guo,Runze Yang,Mengqi Chen,Yan Liu,Jingqi Liu,Yahan Pei,Xuehao Ma,Qiuyun Zhang,Zhiwen Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Persuasive dialogue requires reasoning about others’ latent mental states, a capability known as Theory of Mind (ToM). However, due to reliance on simple prompting strategies and insufficient ToM knowledge, existing LLMs often fail to capture the intrinsic dependencies among mental states, leading to fragmented representations and unstable reasoning. To address these challenges, we introduce the ToM-based Persuasive Dialogue (ToM-PD) task, grounded in the Belief-Desire-Intention (BDI) framework, which explicitly models the sequential dependencies among mental states in multi-turn dialogues. To facilitate research on this task, we construct a large-scale annotated dataset, ToM-based Broad Persuasive Dialogues (ToM-BPD), capturing fine-grained mental states and corresponding persuasive strategies. We further propose Think Thrice Before You Speak (TTBYS), a knowledge-enhanced stepwise reasoning framework that leverages both explicit and implicit prior experiences to improve LLMs’ inference of desires, beliefs, and persuasive strategies. Experimental results demonstrate that Qwen3-8B equipped with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% in predicting desires, beliefs, and persuasive strategies, respectively. Case studies further show that our approach enhances interpretability and consistency in reasoning.

[AI-28] MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

链接: https://arxiv.org/abs/2605.22597
作者: Jiaxu Wang,Junhao He,Jingkai Sun,Yi Gu,Yunyang Mo,Jiahang Cao,Qiang Zhang,Renjing Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at this https URL.

[AI-29] Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

链接: https://arxiv.org/abs/2605.22568
作者: Sahar Abdelnabi,Chris Hicks,Konrad Rieck,Ahmad-Reza Sadeghi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.

[AI-30] Dynamic Hypergraph Representation Learning for Multivariate Time Series without Prior Knowledge

链接: https://arxiv.org/abs/2605.22540
作者: Marco Gregnanin,Johannes De Smedt,Giorgio Gnecco,Maurizio Parton
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hypergraphs have the capacity to capture higher-dimensional relationships among entities across various domains, making them a subject of growing interest within the research community for understanding the structure and dynamics of complex systems. However, a key challenge is the derivation of hypergraph representations from time series data in situations where the structure of the hypergraph is limited or absent. In this study, we propose a model that constructs a dynamic hypergraph representation for multivariate time series without relying on prior knowledge of the data. This is achieved by applying community detection to the time series and transforming the resulting communities, obtained through an attention mechanism, into a hypergraph using a clique-based technique. Hypergraph representations are derived from different time series datasets, and the resulting hypergraphs are then used by a Dynamic Hypergraph Attention Convolution Network (DHACN) for multivariate time series predictions. This research advances the field of hypergraph representation by introducing a novel approach that is better suited to uncover high-order relationships without prior knowledge.

[AI-31] rminalWorld: Benchmarking Agents on Real-World Terminal Tasks

链接: https://arxiv.org/abs/2605.22535
作者: Zhaoyang Chu,Jiarui Hu,Xingyu Jiang,Pengyu Zou,Han Li,Chao Peng,Peter O’Hearn,Earl T. Barr,Mark Harman,Federica Sarro,He Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from “in-the-wild” terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at this https URL.

[AI-32] A Subjective Logic-based method for runtime confidence updates in safety arguments

链接: https://arxiv.org/abs/2605.22530
作者: Benjamin Herd,Jessica Kelly,Clarissa Heinemann,João-Vitor Zacchi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 41st ACM/SIGAPP Symposium on Applied Computing (SAC 2026)

点击查看摘要

Abstract:We present a method for dynamic quantitative assurance that enhances static safety cases with continuous, runtime-driven confidence updates. The method quantifies and propagates confidence across the development lifecycle by integrating design-time evidence and windowed runtime Safety Performance Indicators (SPIs) within a single Subjective Logic (SL)-based assurance case. At runtime, SPI evidence is continuously evaluated, and targeted claims are updated using a rule that increases confidence in the absence of violations and imposes prompt penalties when violations occur. This design prioritizes safety-relevant responsiveness over exact classical Bayesian posterior updates. We demonstrate the method using a simulation-based construction zone assist function, focusing on an ML-based construction cone detection component, and show how confidence evolves as SPI evidence is observed in operation.

[AI-33] Stabilising Explainability Frag ility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets

链接: https://arxiv.org/abs/2605.22529
作者: Ioannis J. Vourganas,Anna Lito Michala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 3 figures, submitted to ACM TAISAP

点击查看摘要

Abstract:This paper investigates a unexplored yet impactful vulnerability in AI explainability used in intrusion detection (IDS): multicollinearity-induced instability. Despite extensive reliance on post-hoc explainability tools such as SHAP or LIME, the impact of correlated features on explanation robustness is not evaluated. We introduce a formal theorem stating that multicollinearity inflates attribution variance. This demonstrates that explanations and feature importances are non-identifiable under multicollinearity. A suite of comprehensive experiments validates the theorem on a representative benchmark dataset, UNSW-NB15. Four widely used families of models are evaluated, including linear, tree-based, kernel, and neural, across full and pruned feature sets based on VIF and correlation thresholding. We propose the novel metric of Explanability Fragility Score and two novel methods to mitigate it with variable integration complexity. CAA-Filtering focuses on stabilising explanations by grouping attributions of trained models. SHARP is a novel training-time regularisation framework that penalises attribution instability, enabling controllable and monotonic improvement of explainability stability. The findings support stable predictive performance, using Kendall’s \tau to quantify instability across bootstrapped explanations. This work has direct implications for the trustworthiness and reproducibility of XAI in security-critical contexts, and motivates incorporating multicollinearity mitigations into the IDS pipelines, providing a set of guidelines for practitioners.

[AI-34] Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems

链接: https://arxiv.org/abs/2605.22513
作者: Jiaqi Yan,Ankush Chakrabarty,Niklas Schmid,John Lygeros,Alisa Rupenyan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:In this paper, we address the problem of reference tracking for uncertain nonlinear systems. Since collecting data from the target system (i.e., the system of interest) is often challenging, our objective is to design optimal controllers using limited target system data. Meta-learning provides a promising paradigm by leveraging offline data from source systems (systems sharing structural similarities with the target system) to accelerate training and enhance control performance. Motivated by this idea, we propose a meta-learning-based control framework that tailors the implicit model-agnostic meta-learning (iMAML) algorithm to the control setting. The framework operates in two phases: an (offline) meta-training phase, where an aggregated representation is learned from source data to capture the shared system dynamics among similar systems, and an (online) meta-adaptation phase, where this representation is fine-tuned on the target system using only a few data samples and limited adaptation steps. We formulate this framework as a bi-level optimization problem and provide an efficient solution with reduced storage complexity and few approximations. The proposed framework is general, allowing various learning algorithms to be integrated. To demonstrate this flexibility, we propose two specific learning algorithms that can be incorporated into our framework based on a neural state-space model and a deep Q-network, respectively. The primary distinction between these approaches is whether explicit system identification is required. Numerical simulations and hardware experiments demonstrate that the proposed methods enhance control performance and consistently outperform baseline approaches.

[AI-35] owards Direct Evaluation of Harness Optimizers via Priority Ranking

链接: https://arxiv.org/abs/2605.22505
作者: Kai Tzu-iunn Ong,Minseok Kang,Dongwook Choi,Junhee Cho,Seungju Kim,Seungwon Lim,Geunha Jang,Minwoo Oh,Bogyung Jeong,Sunghwan Kim,Taeyoon Kwon,Jinyoung Yeo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Work in Progress

点击查看摘要

Abstract:Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents’ performance gains. This indirect end-improvement evaluation neglects optimizers’ actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers’ informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers’ ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at this https URL.

[AI-36] Compiling Agent ic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

链接: https://arxiv.org/abs/2605.22502
作者: Simon Dennis,Rivaan Patil,Kevin Shabahang,Hao Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model’s system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model – creating a subterranean agent – should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

[AI-37] he Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning

链接: https://arxiv.org/abs/2605.22498
作者: Lucas Sheneman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Use: 21 pages, 10 figures, 10 tables. Preprint; source code available at this https URL

点击查看摘要

Abstract:Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand-written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first-order Scheme-like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating-point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka-Volterra dynamics, a damped pendulum, a one-dimensional heat equation, three-dimensional vector mechanics, and compositional generalization. Compiled modules match hand-coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand-coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string-in, module-out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.

[AI-38] Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

链接: https://arxiv.org/abs/2605.22493
作者: Lorenzo Mazza,Massimiliano Datres,Ariel Rodriguez,Sebastian Bodenstedt,Gitta Kutyniok,Stefanie Speidel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.

[AI-39] Implicit Regularization of Mini-Batch Training in Graph Neural Networks

链接: https://arxiv.org/abs/2605.22480
作者: Clement Wang,Antoine Vialle,Robin Vaysse,Thomas Bonald
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mini-batch training of Graph Neural Networks (GNNs) is fundamentally different from training on i.i.d. data: sampling a subgraph alters the topology and introduces boundary effects, leading prior work to develop structure-aware samplers that preserve local connectivity and reduce embedding variance. Surprisingly, we demonstrate that the simplest possible scheme, Random Node Sampling (RNS), training on the induced subgraph of uniformly sampled nodes, matches or outperforms full-graph training on 8 of 10 datasets at a fraction of the wall-clock time and memory. To explain this, we apply backward error analysis to graph mini-batch Stochastic Gradient Descent (SGD) and show that it implicitly minimizes the sampled loss plus a regularizer proportional to the mini-batch gradient variance, a quantity directly shaped by the sampler. Although RNS discards local structure, it produces mini-batches whose expected loss is closer to the full-graph loss, and whose per-batch gradients have lower variance, yielding a better implicit objective. Our analysis reframes the choice of graph sampler as a form of implicit regularization, and identifies RNS as a strong, theoretically grounded method for scalable GNN training.

[AI-40] BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series

链接: https://arxiv.org/abs/2605.22468
作者: Guikang Du,Haoran Li,Xinyu Liu,Zhibo Zhang,Xiaoli Gong,Jin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen this http URL key challenge is to suppress subject specific variability in BTS this http URL existing methods implicitly suppress the variability through model building or subject adversarial learning, but rarely model it this http URL introduce spectral drift as a new perspective to characterize subject specific this http URL, BTS signals under the same label often share consistent oscillatory structure, yet exhibit subject-dependent magnitude or phase shifts in specific frequency components, which we interpret as subject-specific variability. Building on this insight, we propose this http URL its core is a Frequency-Band Alignment Module(FBAM) that generates band-wise modulation factors from the spectral distribution and adaptively adjusts amplitude and phase to align spectral structure, thereby mitigating this http URL further pair FBAM with Sample Conditional Layer Normalization, which infers normalization parameters from intrinsic signal statistics rather than subject identity, stabilizing cross-subject this http URL experiments on six datasets demonstrate that BioFormer outperforms 12 baselines, yielding absolute F1-score improvements of 6%.

[AI-41] KAPPS: A knowledge-based CPPS Architecture for the Circular Factory

链接: https://arxiv.org/abs/2605.22457
作者: Etienne Hoffmann,Jan-Felix Klein,Sören Weindel,Max Goebels,Sebastian Behrendt,Daniel Hernández,Ratan Bahadur Thapa,Jürgen Fleischer,Kai Furmans,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Submitted to Journal of Manufacturing Systems (JMS)

点击查看摘要

Abstract:While linear manufacturing relies on homogeneous materials and predefined process sequences, circular manufacturing reintroduces used products with heterogeneous and uncertain conditions. This shift demands manufacturing systems capable of handling variable product states, dynamically reconfigurable processes, and the integration of human and machine knowledge. Conventional manufacturing IT architectures, designed for stable structures and deterministic execution, are unable to meet these requirements, as they cannot adequately represent and manage the uniqueness of individual components at runtime. Following a design science methodology for developing a Cyber Physical Production System for circular manufacturing, we derive 14 requirements from five complementary perspectives. Based on these requirements, we design KAPPS, a knowledge-based architecture that uses an ontology-grounded knowledge graph as a unifying data backbone, combined with a semantic interface layer to enable consistent data and information integration, reasoning, and communication across heterogeneous systems and services, turning the knowledge graph from an integration layer into the factories authoritative write-time state. KAPPS incorporates modules for constraint enforcement and event-driven planning, enabling incremental adaptation of execution plans under uncertainty and human-machine knowledge exchange. The applicability of KAPPS is demonstrated through two implemented use cases: (i) Anomaly detection and learning through knowledge graph mediated services and (ii) runtime constraint enforcement in a modular conveyor system. Subsequently, the architecture is evaluated against the 14 requirements (ed. abstract shortened)

[AI-42] Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning

链接: https://arxiv.org/abs/2605.22456
作者: Anjie Qiu,Hans D. Schotten
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 5 tables, submitted to IEEE transaction of intelligent vehicles

点击查看摘要

Abstract:Cloud-hosted LLM driver agents provide useful semantic judgments, but their inference latency exceeds stepwise vehicle-control windows. Learned world models predict futures, but they usually keep future generation and action selection inside large coupled loops. We present SteinsGateDrive, a latency-decoupled planner-runtime architecture in which the worldline metaphor from the eponymous story names one plausible consequence of an intervention: the LLM selects counterfactual driving futures before the final control instant, and a runtime reuses the selected forecast only while safety contracts remain valid. The generator builds three world-line roles: alpha nominal ego-conditioned futures, beta interaction counterfactuals around nearby vehicles, and gamma hazard-stress futures such as braking, cut-ins, or blocked corridors. The selected branch becomes a typed StrategicForecast with horizon, validity/abort conditions, fallback, and authority. On a within-subject, matched-seed normal-highway protocol with 10 seeds and 20 steps, GPT-5.4 mini reduces effective lag from +3.07 s at 1-second horizon to -0.01 s at 4-second horizon while preserving the measured no-collision safety boundary. The architecture’s safety contribution comes from the atom-predicate runtime check, not from the drift score, which functions as a refresh-frequency knob.

[AI-43] Dont Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

链接: https://arxiv.org/abs/2605.22454
作者: Benjamin Poole,Andrew Quinn,Li Yang,Minwoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic regularization. This actor-centric approach overlooks the potential of data rehearsal for value function approximation. Moreover, existing evaluations in CRL rarely consider multi-cyclic environments where task sequences repeat, a critical real-world scenario that exacerbates forgetting and plasticity. We investigate data rehearsal for Deep Q-Networks using Q-value regularization in multi-cyclic settings and propose Qreg+NWLU which introduces two simple modifications: (1) continuous data rehearsal that dynamically collects and updates stored Q-values throughout training, and (2) “No-Wait” regularization that applies immediately rather than after the first task. Together, these modifications yield improvements in learning efficiency, forgetting mitigation, and knowledge transfer over Qreg and conventional CRL methods within value function approximation settings.

[AI-44] S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration ICME2026

链接: https://arxiv.org/abs/2605.22448
作者: Sijing Yin,Jiamou Liu,Xiao Tang,Yaser Shakib,Qian Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures. Accepted by IEEE ICME 2026

点击查看摘要

Abstract:Multi-frame story illustration requires long-horizon coherence beyond single-image text-to-image generation, including narrative decomposition and persistent character identity, layout, and affect across frames. We propose Story-to-Executable Descriptions (S2ED), a training-free, model-agnostic, prompt-layer framework that converts a full story into a sequence of explicit, editable executable descriptions for more consistent rendering. S2ED coordinates three agents to segment the narrative, ground canonical character attributes, and enrich spatial and affective cues, enabling interpretable prompt-carried state propagation and local edits to repair drift without retraining the generator. Experiments on Flintstones and Shakoo Maku show that S2ED improves sequence-level consistency and character fidelity over strong prompting, large-model planning, and a reference training-based method, under both automatic metrics and human judgments. We also deploy S2ED in an end-to-end story-to-storybook system for children’s illustrated stories, with a supplementary video.

[AI-45] A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers

链接: https://arxiv.org/abs/2605.22441
作者: Andrii Tyvodar,Andreas Rechberger,Dirmanto Jap,Shivam Bhasin,Bernhard Jungk,Jakub Breier,Xiaolu Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embedded neural-network inference can leak information through timing side channels, including leakage caused by the evaluation of activation functions. This work proposes a constant-time implementation methodology for activation functions on embedded microcontrollers and validates it on ReLU, sigmoid, tanh, GELU, and Swish on an ARM Cortex-M4 platform. The proposed methodology combines branchless selection, fixed-cost Padé-based approximation, dummy arithmetic where needed, and cycle alignment to obtain timing-regular activation-function implementations. As motivation, we also evaluate a desynchronization-based countermeasure and show that it remains vulnerable to a template-based timing attack. Experimental results show that the resulting protected implementations achieve identical cycle counts for all tested inputs, including (88) cycles in the three-function setting and (108) cycles in the five-function setting. At the same time, the numerical-error analysis indicates that the approximated nonlinear functions retain high accuracy. These results suggest that the proposed methodology provides a practical basis for constructing side-channel-resistant activation functions in embedded inference.

[AI-46] Characterizing the Fault Response of the Intel Neural Compute Stick 2 Under Single-Pulse Electromagnetic Fault Injection

链接: https://arxiv.org/abs/2605.22437
作者: Štefan Kučerák,Jakub Breier,Xiaolu Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision processing units and other commercial neural-network inference accelerators are increasingly deployed in safety-relevant edge applications, but their fault response under transient hardware disturbances remains poorly characterized in the open literature. For the Intel Movidius Myriad X, packaged as the Intel Neural Compute Stick 2 (NCS2), only a single feasibility study has been published. We report a systematic single-pulse electromagnetic fault injection (EMFI) campaign on the NCS2 running three ImageNet-trained convolutional neural networks (ResNet-18, ResNet-50, VGG-11) on the OpenVINO runtime. Across 1,536 spot-test trials at characterized hotspots and approximately 16,000 parameter-search trials, single pulses produce four reproducible outcome classes: no measured accuracy change, minor silent data corruption, major persistent degradation that survives across subsequent inferences until model reload, and device hangs requiring USB power-cycling; these outcomes are respectively interpreted as no-effect, SDC with possible SET-like or small persistent-state mechanisms, SEU-like persistent corruption, and SEFI-like loss of functionality. Two findings are central. First, the major-degradation class can be induced at 18-31% of trials at characterized hotspots, with post-collapse top-1 accuracy below five percent and persistence across all subsequent inferences until explicit model reload - a regime that no inference-API-level mechanism detects. Second, this regime is also inducible by pulses delivered to an idle device with the model already loaded, demonstrating that load-time integrity checks alone are insufficient. We discuss mitigation strategies graded by class, focusing on mechanisms implementable at the application level without modification to the device firmware or the OpenVINO runtime.

[AI-47] VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

链接: https://arxiv.org/abs/2605.22368
作者: Yifan Bai,Xiaoyang Liu,Zihao Mou,Guihong Wang,Jian Yu,Shuhan Xie,Yantao Li,Yangyu Zhang,Jingwei Liang,Tao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed for software engineering, constructing high-quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test-suite expansion to construct diverse and challenging test cases, and test-suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83 \times , and VerinaLite, a lightweight 14 \times variant. Our experiments across eight state-of-the-art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at this https URL.

[AI-48] meGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting ICML2026

链接: https://arxiv.org/abs/2605.22365
作者: Quang Duc Nguyen,Siyuan Liang,Yiming Li,Fushuo Huo,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 44 pages, 30 figures. ICML 2026

点击查看摘要

Abstract:Time Series Forecasting (TSF) plays a critical role across many domains, yet it is vulnerable to backdoor attacks. However, backdoor defenses tailored to TSF remain underexplored, due to data entanglement and task-formulation shift challenges. To fill this gap, we conduct a systematic evaluation of thirteen representative backdoor defenses across the TSF life cycle and analyze their failure modes. Our results reveal two fundamental issues: (1) data entanglement induces channel-level signal dilution, rendering sample-filtering and trigger-synthesis defenses ineffective at localizing backdoors; and (2) task-formulation shift leads to training-loss degeneration, causing poisoned and clean windows to become indistinguishable at training stages. Based on these findings, we propose a training-time backdoor defense for TSF, termed TimeGuard. Our method adopts channel-wise pool training as the core paradigm and initializes a high-confidence pool using time-aware criteria to mitigate signal dilution. Moreover, we introduce distance-regularized loss selection to progressively expand the reliable pool during training and ease loss degeneration. Extensive experiments across multiple datasets, forecasting architectures, and TSF backdoor attacks demonstrate that TimeGuard substantially improves robustness, boosting \mathrmMAE_\mathrmP by 1.96\times over the leading baseline, while preserving clean performance within 5% \mathrmMAE_\mathrmC .

[AI-49] Scaling Observation-aware Planning in Uncertain Domains

链接: https://arxiv.org/abs/2605.22364
作者: Adrian Zvizdenco,Arthur Conrado Veiga Bosquetti,Alberto Lluch Lafuente,Christoph Matheja
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deciding which sensing capabilities to deploy on an agent in uncertain domains is a fundamental engineering challenge, in which one balances task achievability against the high costs of hardware and processing. This problem has previously been formalized as the Optimal Observability Problem (OOP), based on the well-known Partially Observable Markov Decision Process (POMDP) model for decision-making. This work studies (sub-)symbolic techniques to scale solving of decidable fragments of the OOP, namely the Sensor Selection Problem (SSP) and the Positional Observability Problem (POP). Besides improving the original approach based on parameter synthesis, we develop a new solving method that identifies sensible observation functions via decomposition of POMDPs, improving performance by 3 and 5 orders of magnitude for instance size and runtime, respectively.

[AI-50] Meta-Soft: Leverag ing Composable Meta-Tokens for Context-Preserving KV Cache Compression

链接: https://arxiv.org/abs/2605.22337
作者: Wei Luo,Yi Huang,Songchen Ma,Huanyu Qu,Jiang Cai,Mingkun Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long this http URL KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task this http URL, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix \mathcalL , and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted k Soft Tokens from the input prompt this http URL append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information this http URL on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

[AI-51] SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection

链接: https://arxiv.org/abs/2605.22331
作者: Santiago Ospitia,John Sanabria,John Garcia-Henao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 13 pages, 5 figures. Submitted to BioCARLA 2025 Workshop

点击查看摘要

Abstract:Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems-level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI-Orchestrator, an open-source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50-1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12-thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over-provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U-shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at this https URL.

[AI-52] Benchmarking Autonomous Agents against Temporal Spatial and Semantic Evasions

链接: https://arxiv.org/abs/2605.22321
作者: Jianan Ma,Xiaohu Du,Ruixiao Lin,Yaoxiang Bian,Jialuo Chen,Jingyi Wang,Xiaofang Yang,Shiwen Cui,Changhua Meng,Xinhao Deng,Zhen Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 21 pages, 9 figures, 7 tables. Code and data available at this https URL

点击查看摘要

Abstract:As autonomous agents (e.g., OpenClaw) increasingly operate with deep system-level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single-turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi-turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi-dimensional evasion framework targeting LLM-based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S-Bench, a comprehensive benchmark comprising 2,254 real-world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3% baseline to 52.6%. These findings reveal systemic, architecture-level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.

[AI-53] Evaluation of Pipelines for Data Integration into Knowledge Graphs

链接: https://arxiv.org/abs/2605.22304
作者: Marvin Hofer,Erhard Rahm
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Integrating new data into knowledge graphs (KG) typically involves different tasks that are executed within workflows or pipelines There are many possible pipelines for a specific integration problem but there is not yet a general approach to evaluate the overall quality and performance of such pipelines to be able to determine the best choices. We therefore propose a new benchmark KGI-Bench to evaluate integration pipelines that ingest different kinds of input data into an existing KG. We evaluate pipelines by analyzing their output, i.e., the updated KG, with the three complementary quality metrics coverage, correctness and consistency. We also provide benchmark datasets (seed KG, overlapping input data of three formats, reference KG as a ground truth) for the movie domain. To demonstrate the applicability and usefulness of the proposed benchmark, we comparatively evaluate 12 pipelines and analyze their behavior across different input data formats and design choices.

[AI-54] One LR Doesnt Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLM s

链接: https://arxiv.org/abs/2605.22297
作者: Di He,Songjun Tu,Keyu Wang,Lu Yin,Shiwei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M-1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero-shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at this https URL.

[AI-55] SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

链接: https://arxiv.org/abs/2605.22287
作者: Yuxuan Chen,Changwei Lv,Yunduo Xiao,Zhongjing Du,Daquan Zhou,Yukun Yan,Zheni Zeng,Zhiyuan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 9 tables. Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are central to the one-for-all intelligent paradigm, but they face a fundamental challenge when dealing with heterogeneous scientific data such as molecules: the inherent gap between discrete linguistic symbols and topological molecular or continuous reaction data leads to significant information loss and semantic noise in text-based reasoning. We propose SciCore-Mol, a modular framework that bridges this gap through three deeply integrated pluggable cognitive modules: a topology-aware perception module, a latent diffusion-based molecular generation module, and a reaction-aware reasoning module. Each module is coupled to the LLM backbone through learned representation interfaces, enabling richer information exchange than is possible with text-only tool feedback. Our experiments on diverse chemical tasks demonstrate that SciCore-Mol achieves strong comprehensive performance across molecular understanding, generation, reaction prediction, and general chemistry knowledge, with an 8B-parameter open-source system that is competitive with and in several dimensions surpasses proprietary large models. This work provides a systematic blueprint for equipping LLMs with scientific expertise through decoupled, pluggable, and flexibly orchestrated modules, with direct implications for drug design, chemical synthesis, and broader scientific discovery.

[AI-56] EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes

链接: https://arxiv.org/abs/2605.22286
作者: Zhaomin Wu,Jiayi Li,Bingsheng He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-based counseling is an important interface for AI mental-health support, where transcripts may be used to monitor depression severity and flag sessions requiring timely human review. However, robust PHQ-8 prediction across session regimes remains challenging: fine-tuning-based methods can exploit richer supervision but may generalize poorly under data scarcity, while prompt-based LLM methods are data-efficient but usually treat each transcript holistically and provide limited support for longitudinal context. We study robust depression tracking from counseling transcripts across single-session and multi-session regimes. We introduce LongCounsel, a multi-session counseling dataset with session-level PHQ-8 supervision for evaluating repeated-session tracking under partial symptom disclosure and cross-session continuity. We further propose EmoTrack, a PHQ-8 prediction framework that combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors over the resulting transcript representation. When prior sessions are available, EmoTrack can further incorporate them through compact cross-session memory. Experiments on LongCounsel and DAIC-WOZ show that EmoTrack achieves a clear gain on the real single-session benchmark, including a 13.5% relative MAE reduction over the strongest DAIC-WOZ baseline, and remains competitive with the strongest longitudinal baseline on LongCounsel.

[AI-57] Detecting Atypical Clients in Federated Learning via Representation-Level Divergence

链接: https://arxiv.org/abs/2605.22266
作者: Cristian Pérez-Corral,Jose I. Mestre,Alberto Fernández-Hernández,Manuel F. Dolz,Enrique S. Quitana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning enables collaborative training across distributed clients with heterogeneous data, but such heterogeneity often leads to unstable updates and degraded global performance. Moreover, in practical deployments, client updates may deviate from the expected behavior not only due to benign not i.i.d. distributions, but also due to distributional shifts or anomalous inputs, raising concerns about the reliability of the aggregation process. In this work, we propose a lightweight geometric signal to quantify the functional deviation of a client with respect to the global model. Instead of comparing model parameters or gradients, our approach measures how the local training of each client alters the activation-induced partition of the input space, evaluated on a shared probe set. This yields a permutation-invariant, interpretable metric of client–global divergence that captures differences in how data is processed by the model. We show that this signal effectively identifies clients that induce atypical functional changes, distinguishing stable yet heterogeneous clients from those whose updates significantly diverge from the global regime. As a result, the proposed metric provides a simple tool for monitoring client behavior and enabling risk-aware aggregation strategies in federated learning systems.

[AI-58] ailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

链接: https://arxiv.org/abs/2605.22263
作者: Hongbin Zhang,Chaozheng Wang,Kehai Chen,Youcheng Pan,Yang Xiang,Jinpeng Wang,Min Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:On-policy self-distillation (OPSD) is an emerging LLM post-training paradigm in which the model serves as its own teacher: conditioned on privileged information such as a reference trace or hint, the same policy provides dense token-level supervision on its own rollouts. However, recent studies show that OPSD degrades complex reasoning by suppressing predictive uncertainty, which supports exploration and hypothesis revision. Our token-level analysis shows that this failure arises from applying a uniform direction of teacher supervision across tokens with different uncertainty levels: conformity to the privileged self-teacher suppresses exploration at high entropy, while deviation from the teacher degrades step accuracy at low entropy. Accordingly, we propose \textbfDirection-Adaptive Self-Distillation (\textbfDASD), which reframes privileged self-distillation from uniform teacher imitation into entropy-routed directional supervision: high-entropy tokens are pushed away from the privileged teacher to preserve exploration, while low-entropy tokens are pulled toward the teacher to stabilize step-level execution. Across six mathematical reasoning benchmarks, DASD achieves the best macro Avg@16 over strong RLVR and self-distillation baselines. Pass@ k , reasoning-health, and generalization analyses show that these average gains come from preserving exploration without sacrificing step-level execution.

[AI-59] What are the Right Symmetries for Formal Theorem Proving?

链接: https://arxiv.org/abs/2605.22257
作者: Krzysztof Olejniczak,Radoslav Dimitrov,Xingyue Huang,Bernardo Cuenca Grau,Jinwoo Kim,İsmail İlkan Ceylan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

[AI-60] Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

链接: https://arxiv.org/abs/2605.22243
作者: Junyu Yan,Damian Machlanski,Kurt Butler,Panagiotis Dimitrakopoulos,Ewen M Harrison,Bruce Guthrie,Sotirios A Tsaftaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 41 pages, 7 figures

点击查看摘要

Abstract:Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their “black-box” nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

[AI-61] Unlocking Proactivity in Task-Oriented Dialogue

链接: https://arxiv.org/abs/2605.22240
作者: Hongbin Zhang,Ning Gao,Yuqin Dai,Ruiyuan Wu,Jinpeng Wang,Rena Wei Gao,Bingdong Tan,Shuzheng Gao,Zongjie Li,Chaozheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user’s concerns and steers the conversation toward acceptance within a bounded number of turns. Yet post-trained LLMs are inherently conservative, and reward-shaping RL (e.g., GRPO) struggles since it only re-weights what an already passive policy samples. We show that conditioning on the user’s latent concerns unlocks proactive capability that no amount of sampling can undermine, establishing these concerns as a pivotal training-time signal. To operationalize this finding, we build the \textbfCognitive User Simulator, which models each user as a stratified persona comprising observable external traits and hidden internal concerns. The simulator produces faithful and diverse interactions, while emitting per-turn state dynamics that track persuasion progress. We then introduce \textbfSimulator-Induced Asymmetric-View Policy Optimization, which converts the modeled concerns and the simulation state transition into complementary training objectives: (1) \emphAsymmetric On-Policy Self-Distillation that transfers concern-aware behavior from a privileged view of the same policy into its deployable, conversation-only view; and (2) \emphState-Transition Policy Refinement …

[AI-62] Evaluating Large Language Models as Live Strategic Agents : Provider Performance Hybrid Decomposition and Operational Gaps in Timed Risk Play

链接: https://arxiv.org/abs/2605.22238
作者: H. C. Ekne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures. Code and tracked notes: this https URL . Public runtime artifact index: this https URL

点击查看摘要

Abstract:Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.

[AI-63] Can Transformers Learn to Verify During Backtracking Search?

链接: https://arxiv.org/abs/2605.22221
作者: Yin Jun Phua,Tony Ribeiro,Tuan Nguyen,Katsumi Inoue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Backtracking search underlies classical constraint solvers, planners, and theorem provers. Recent transformer-based reasoning systems explore search trees over their own intermediate steps. A common training recipe fits an autoregressive next-token loss on offline solver traces. The model’s input at each step is a cumulative trace of all prior decisions. The optimal continue-or-backtrack predictor depends only on the current search state, since two trajectories reaching the same state admit the same viable continuations. We show that decoder-only transformers trained on cumulative traces fail this requirement in two ways: the trace can scatter state features across many positions (scattered retrieval), and the predictor can condition on the trajectory rather than the state (history entanglement). We address scattered retrieval with localization, a trace-level fix that rewrites each decision block to expose state features locally. We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally without modifying training data, objective, or parameters. We focus on reactive verification, after propagation has exposed a contradiction. We test SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not. Our contribution is a diagnostic of transformer behavior on serialized trajectory data, paired with a structural fix. Pretrained language models that search over their own reasoning steps may face the same failure. Our analysis opens up inference-time context clearing as a candidate way to apply the same isolation without retraining.

[AI-64] SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

链接: https://arxiv.org/abs/2605.22219
作者: Ningyuan Li,Haiyang Shen,Mugeng Liu,Yudong Han,Zhuofan Shi,Sixiong Xie,Yun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress. 23 pages, 7 figures, preprint

点击查看摘要

Abstract:Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit guidance for state-gated retrieval. We evaluate eight CLI-based agentic LLM systems and three commercial search-agent products. On SGR-Bench, the strongest system reaches only 66.18% item-level F1, while row-level F1 remains much lower. A manual audit of 156 analyzable failed CLI trajectories shows why: agents often reach a relevant web source, but establish the wrong site-specific retrieval state. Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate, whereas final answer composition accounts for only 10.3%. The dataset and single-case evaluation instructions are available at this https URL.

[AI-65] owards a compositional semantics for quantitative confidence assessment in assurance arguments

链接: https://arxiv.org/abs/2605.22213
作者: Benjamin Herd,Jessica Kelly,Jan Sabsch,Lydia Gauerhof
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the 21st European Dependable Computing Conference (EDCC 2026), Canterbury, UK

点击查看摘要

Abstract:Assurance arguments provide a clear and structured way to explain why stakeholders should trust that a system satisfies certain properties, yet widely used notations, this http URL Structuring Notation (GSN), typically lack an operational semantics for deriving assurance confidence. Existing approaches address structure and soundness but largely reason over truth values, not over confidence in the justification of claims. Subjective Logic (SL) offers a calculus of belief, disbelief, and uncertainty with operators for combining opinions, enabling confidence propagation under incomplete, conflicting, or subjective evidence. However, existing SL-based approaches do not provide a uniform, compositional semantics that covers all argument elements and relations to enable overall confidence assessment. We propose a confidence semantics that represents argument elements as SL opinions and maps relations between elements to SL operators modelling how confidence flows, effectively turning the argument into an analyzable confidence network. The approach provides explicit warrants, principled handling of context, preserved provenance, and compatibility with GSN, along with practical guidance using an exemplary assurance confidence assessment.

[AI-66] CLORE: Content-Level Optimization for Reasoning Efficiency

链接: https://arxiv.org/abs/2605.22211
作者: Yuyang Wu,Qiyao Xue,Guanxing Lu,Weichen Liu,Zihan Wang,Manling Li,Olexandr Isayev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented–original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy–efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.

[AI-67] mporal Coding as a Substrate for Sensorimotor Object Inference: A Spiking Reinterpretation of Thousand Brains Architecture

链接: https://arxiv.org/abs/2605.22206
作者: Joy Bose
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:The Thousand Brains Theory (TBT) and its open-source Monty framework model object recognition through sensorimotor inference – identifying objects by actively moving a sensor across their surface and building evidence contact by contact. The current implementation encodes each contact as a dense floating-point vector. While Monty tracks inter-step displacement and accumulates evidence across contacts, it treats the feature activation pattern at each contact as an unordered set - the directional sequence in which features are encountered carries no representational weight. In TBT, the sequence of contacts carries spatial meaning: knowing that feature A was felt before feature B during a left-to-right sweep tells you something about where A and B sit on the object. Dense vectors discard this ordering. We propose replacing dense vectors with rank-order spike packets: each contact produces a brief burst of neural events where the most strongly activated neuron fires first. The time gap between successive bursts implicitly encodes sensor displacement without explicit coordinate calculations. A biologically motivated learning rule (STDP) encodes traversal direction into synaptic weights. A learnable parameter lambda adjusts reliance on earlier versus recent contacts, adapting to each object’s geometry. We derive three testable predictions and specify an implementation of four components in approximately 450 lines of NumPy. Three synthetic experiments confirm the core claims: temporal coding achieves perfect discrimination accuracy on objects with identical features in different spatial arrangements, where dense accumulation performs at chance; temporal coding maintains a 30-50 percentage point advantage across all tested noise levels; the adaptive lambda converges to distinct values, reflecting object geometric complexity. End-to-end evaluation on Monty’s YCB benchmark is left for future work.

[AI-68] Skill Weaving: Efficient LLM Improvement via Modular Skillpacks ACL2026

链接: https://arxiv.org/abs/2605.22205
作者: Zhuo Li,Guodong Du,Zesheng Shi,Weiyang Guo,Weijun Yao,Yuan Zhou,Jiabo Zhang,Jing Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ACL2026

点击查看摘要

Abstract:Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWeave, a modular improvement framework that enables LLMs to specialize under fixed memory budgets. SkillWeave partitions full capabilities of a general-purpose model into skillpacks – lightweight, domain-specific delta modules – that reorganize and refine the model’s internal knowledge. For efficient deployment, SkillWeave integrates SkillZip to compress skillpacks into compact and inference-ready format, enabling strong multi-domain performance with low-latency execution. On multi-task and agentic benchmarks, a 9B SkillWeave model outperforms several baselines and even surpasses a 32B monolithic LLM, while achieving up to 4x speedup.

[AI-69] Action with Visual Primitives

链接: https://arxiv.org/abs/2605.22183
作者: Weilong Guo,Yuchen Wang,Renping Zhou,Yunfeng Zhang,Rui Fang,Yue Meng,Wenda Xu,Yuan He,Gao Huang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures. Project page: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 27.61% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.

[AI-70] LLM -Metrics: Measuring Research Impact Through Large Language Model Memory

链接: https://arxiv.org/abs/2605.22176
作者: Si Shen,Wenhua Zhao,Danhao Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25pages, 5figures

点击查看摘要

Abstract:Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.

[AI-71] SWE-Mutation: Can LLM s Generate Reliable Test Suites in Software Engineering?

链接: https://arxiv.org/abs/2605.22175
作者: Yuxuan Sun,Yuze Zhao,Yufeng Wang,Yao Du,Zhiyuan Ma,Jinbo Wang,Mengdi Zhang,Kai Zhang,Zhenya Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool’’ the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.

[AI-72] Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

链接: https://arxiv.org/abs/2605.22168
作者: Joël Roman Ky,Salah Ghamizi,Maxime Cordy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall’s \tau = -0.06 ). To resolve this, we introduce Synergistic Faithfulness ( \mathcalF_syn ), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ( \rho = 0.92 ) while achieving a 24\times computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

[AI-73] Adapting the Interface Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

链接: https://arxiv.org/abs/2605.22166
作者: Tianshi Xu,Huifeng Wen,Meng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model–environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed during held-out evaluation. On seven deterministic environments from \tau -bench, \tau^2 -bench, and AgentBench, Life-Harness improves 116 out of 126 model–environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at GitHub.

[AI-74] One-Way Policy Optimization for Self-Evolving LLM s

链接: https://arxiv.org/abs/2605.22156
作者: Shuo Yang,Jinda Lu,Kexin Huang,Chiyu Ma,Shaohang Wei,Yuyang Liu,Guoyin Wang,Jingren Zhou,Li Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect’’ that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

[AI-75] IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

链接: https://arxiv.org/abs/2605.22154
作者: Daewon Choi,Kyunghyun Park,Woomin Song,Saket Dingliwal,Sai Muralidhar Jayanthi,Jinwoo Shin,Aram Galstyan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents solve complex tasks by leveraging multi-step reasoning with iterative tool calls and environment interactions, which incur idle time while waiting for observations. Despite the prevalence of idle time in most agentic scenarios, existing works treat it as an unavoidable overhead or propose restricted solutions that overlook varying computational budgets across different tool calls and future observation uncertainty, thereby leading to suboptimal utilization of idle time. In this paper, we introduce IdleSpec, a scalable and generic inference approach that leverages idle-time computation to improve agent performance while minimizing latency overhead. Specifically, IdleSpec iteratively generates plan candidates during idle periods and, once observations become available, aggregates them to guide the next reasoning step. For effective plan generation under observation uncertainty, IdleSpec samples between complementary drafting strategies (i.e., progressive and recovery) from a learned distribution that is updated via posterior feedback. Our experiments demonstrate that IdleSpec significantly improves agent performance in various agentic scenarios by effectively utilizing idle time. In particular, on the GAIA and FRAMES, IdleSpec achieves 55.6% average accuracy with Gemini-2.5-Flash, surpassing the vanilla baseline without idle-time usage by 5.1%. Furthermore, for MLE-Bench, which involves substantial delay from code executions, IdleSpec achieves performance gains of up to 9.1% on the Any Medal rate, highlighting its generalizability to long-horizon tasks.

[AI-76] Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

链接: https://arxiv.org/abs/2605.22142
作者: Taewoon Kim,Vincent François-Lavet,Michael Cochez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

[AI-77] Adversarial Trust Poisoning in Vehicular Collaborative Perception

链接: https://arxiv.org/abs/2605.22122
作者: Yutong Liu,Chenyi Wang,Ming F. Li,Qingzhao Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collaborative perception (CP) enables connected and autonomous vehicles to share sensor data and jointly reason about their environment. To defend against adversaries that fabricate or manipulate shared data, existing systems employ cross-vehicle inconsistency detection and trust estimation, penalizing vehicles whose observations conflict with the majority. In this work, we show that these defenses themselves introduce a new attack surface. We present TrustFlip, a novel attack that weaponizes consistency-based defenses to poison the trust assigned to benign vehicles. Instead of injecting false data into the collaboration pipeline, it deploys physical adversarial objects that are genuine but induce inconsistent observations among benign vehicles. The resulting inconsistencies are misattributed by the defense to the targeted vehicle, causing its trust score to degrade and eventually leading to its downweighting or exclusion from collaboration. Consequently, the system loses reliable sensing contributors, degrading perception capability and potentially inducing safety-critical failures. We evaluate TrustFlip across multiple collaborative perception architectures and defense mechanisms. Our results show that state-of-the-art defenses can be significantly affected: the attack removes the targeted benign vehicle from collaboration in up to 87.7% of scenarios and drops Average Precision (AP) by up to 13%. As an initial mitigation, we introduce TrustReflect, a lightweight self-reflection mechanism that marks disputed regions as uncertain and excludes them from trust evaluation, reducing the attack success rate by 35-100%.

[AI-78] ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

链接: https://arxiv.org/abs/2605.22106
作者: Yeqiu Chen,Ziyan Liu,Zhenxin Huang,Runquan Gui,Hong Wang,Lei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key–Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

[AI-79] ExComm: Exploration-Stage Communication for Error-Resilient Agent ic Test-Time Scaling

链接: https://arxiv.org/abs/2605.22102
作者: Woomin Song,Beomjun Kim,Daewon Choi,Sai Muralidhar Jayanthi,Saket Dingliwal,Jinwoo Shin,Aram Galstyan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent’s belief state and contaminate later reasoning. Existing test-time scaling methods provide limited control over this process, as they often rely on agents to detect their own mistakes, select among flawed trajectories, or refine solutions only after errors have already shaped the reasoning path. We propose ExComm, a communication protocol for exploration-stage agentic test-time scaling. ExComm is motivated by the empirical observation that the majority of intermediate errors in parallel agentic reasoning produce detectable cross-agent factual conflicts. Leveraging the iterative structure of agentic workflows, ExComm periodically audits agent belief states to detect such conflicts, resolves them through a dedicated tool-based verification loop, and returns concise, targeted feedback to the involved agents. Corrections are incorporated through soft belief updates, which append verified feedback rather than overwriting existing beliefs. Furthermore, to prevent collapsing trajectory diversity due to communication, ExComm further introduces a trajectory diversification module that redirects redundant trajectories toward orthogonal strategies. Experiments on AIME 2024, AIME 2025, and GAIA with Gemini-2.5-Flash-Lite and Qwen3.5-4B show that ExComm consistently outperforms strong test-time scaling baselines, achieving average performance gains of 5.7% and 5.0% over the best-performing baselines, respectively. Further analyses demonstrate improved error recovery, favorable scaling behavior, stronger diversity than adapted communication baselines, and the best performance-cost trade-off among the evaluated methods.

[AI-80] MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

链接: https://arxiv.org/abs/2605.22100
作者: Bangbang Zhou,Hangdi Xing,Yifan Chen,Jianjun Xu,Qi Zheng,Feiyu Gao,Zhibo Yang,Shuai Bai,Ming Yan,Jieping Ye,Hongtao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

[AI-81] Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

链接: https://arxiv.org/abs/2605.22093
作者: Enrico Daga,Valentina Tamma,Terry Payne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graphs have become the primary vehicle for data integration and are critical to the success of modern AI, but the diversity of KG modelling practices, from lightweight vocabularies to richly axiomatised ontologies, makes integration and reuse expensive and brittle. This challenge is particularly acute in neuro-symbolic AI, where bridging neural and symbolic components depends on the ability to reengineer KGs to fit new requirements; GenAI now offers unprecedented automation capability, but without a principled understanding of the KG space, such automation remains conceptually ungrounded. We introduce the ontological continuum as that missing conceptualisation, a theoretical construct a theoretical construct whose characterisation framework is defined by two orthogonal distinctions: semantics vs pragmatics, and properties vs affordances; together these define a vocabulary to describe, compare, navigate, and transform KGs across the full range of modelling practices. The methodological stance is empirical: rather than prescribing how KGs should be modelled, the continuum aims to define a theory of the existent, derived from observation of real-world KG engineering practices and whose structure can be made formally explicit, for example, through Formal Concept Analysis (FCA). We ground the vision through a case study on provenance knowledge, showing how a single concern manifests differently across the continuum. We articulate five open research challenges and invite the community to develop the ontological continuum as a shared research agenda.

[AI-82] A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing

链接: https://arxiv.org/abs/2605.22090
作者: Wenfeng Wu,Luping Xiang,Kun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The detection of non-cooperative unmanned aerial vehicles (UAVs) presents significant challenges for Integrated Sensing and Communication (ISAC) systems due to the inherent limitations of single-modal perception and the competition for shared communication and sensing resources. To address these challenges, this paper proposes a novel Camera-Cooperative ISAC (CC-ISAC) framework that employs multimodal sensing to enable efficient UAV beam steering and tracking. The proposed framework employs cameras for coarse-grained airspace monitoring and utilizes ISAC for fine-grained, high-precision sensing, forming a complementary perception loop that enhances both sensing accuracy and resource efficiency. Within this framework, two key modules are developed: (1) a Vision-to-Echo Data Alignment (V2EDA) model that aligns visual and echo-domain features through cross-attention mechanisms, and (2) a Multimodal Fusion-Based Estimation (MMFE) model that integrates historical multimodal data with current observations for robust state estimation. Extensive evaluations conducted on the DeepSense 6G dataset demonstrate that the proposed framework achieves an average reduction of 71% in beam steering overhead and 1.69-11.15% in tracking overhead while maintaining high angular estimation accuracy. The CC-ISAC framework effectively mitigates resource contention between sensing and communication, enabling reliable UAV surveillance while freeing substantial system resources for additional communication tasks, thereby representing a practical advancement in ISAC system design.

[AI-83] Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation

链接: https://arxiv.org/abs/2605.22060
作者: Yilan Gao,Sida Huang,Hongyuan Zhang,Xuelong Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Closed-weight generative services are increasingly deployed through query-based APIs, where users can obtain generated outputs while model parameters remain inaccessible. However, such deployment does not prevent model stealing: an attacker can repeatedly query the service, collect large volumes of released synthetic images, and use them as training data for a private substitute model. This query-output-driven process enables unauthorized knowledge distillation and capability replication without direct access to the original weights. To mitigate this threat, a practical defense should preserve the visual fidelity of released images, provide explicit control over perturbation magnitude, and scale efficiently to large-volume output release. We present WaveGuard, a single-pass, generator-based protection framework that safeguards released synthetic images under a user-specified perturbation budget. WaveGuard employs a frequency-aware perturbation generator to inject structured, imperceptible perturbations that maintain perceptual utility for benign viewers while reducing the usefulness of protected images as training data for unauthorized student models. Extensive experiments under WikiArt-related synthetic-output distillation settings show that WaveGuard achieves a favorable efficacy–fidelity–efficiency trade-off, with explicit imperceptibility control and substantial gains in protection efficiency.

[AI-84] Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series

链接: https://arxiv.org/abs/2605.22055
作者: Xianhao Song,Yuang Zhang,Yuqi She,Liping Wang,Xuemin Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time Series Classification (TSC) is a long-standing research problem that has gained increasing attention in recent years with the rapid growth of large-scale temporal data. Despite substantial progress enabled by deep learning, designing TSC models that are both accurate and interpretable remains a challenging task. Many existing approaches adopt a direct feature-to-label classification paradigm, by collapsing high-dimensional temporal embeddings into class logits via a single linear projection (often after global pooling), the paradigm conflates feature extraction and decision logic into an inseparable mapping. To address these limitations, we propose PDFTime, a prototype-guided framework that reformulates time series classification as a multi-stage decision process. Instead of direct feature-to-label mapping, PDFTime leverages learned prototypes to approximate class-conditional feature distributions in the latent space, enabling progressive discrimination through classification sub-tasks of varying granularity. To our knowledge, PDFTime is the first framework to reformulate time series classification as a decoupled, multi-stage similarity-based reasoning process, breaking the long-standing paradigm of direct, black-box feature-to-label mapping. Extensive evaluations demonstrate that PDFTime achieves state-of-the-art (SOTA) performance across UEA and UCR benchmarks. Notably, it secures the top- 1 accuracy on 80 out of 128 datasets in the UCR archive, significantly outperforming recent strong baselines in both consistency and generalization. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.22055 [cs.LG] (or arXiv:2605.22055v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22055 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-85] LABO: LLM -Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation ICML2026

链接: https://arxiv.org/abs/2605.22054
作者: Zhuo Chen(equal contribution) (1 and 2),Xinzhe Yuan(equal contribution) (1 and 3),Jianshu Zhang(1 and 4),Jinzong Dong(1 and 5),Ruichen Zhou(6),Yingchun Niu(6),Tianhang Zhou(7),Yu Yang Fredrik Liu(8),Yuqiang Li(1),Nanyang Ye(1 and 4),Qinying Gu(1) ((1) Shanghai Artificial Intelligence Laboratory, Shanghai, China, (2) School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China, (3) Institute for Advanced Study in Mathematics, Harbin Institute of Technology, Harbin, China, (4) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China, (5) School of Automation, Central South University, Changsha, China, (6) College of New Energy and Materials, China University of Petroleum, Beijing, China, (7) College of Carbon Neutrality Future Technology, China University of Petroleum, Beijing, China, (8) DeepVerse PTE. LTD., Singapore)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledge-driven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directly into the sampling or surrogate modeling pipeline, without fully leveraging their significantly lower evaluation cost compared to real-world experiments. To address this limitation, we propose LLM-Accelerated Bayesian Optimization (LABO), a framework that combines LLM predictions with experimental observations within a single BO loop. LABO employs a gating criterion to dynamically balance the reliance on LLM predictions versus actual experiments. By leveraging inexpensive LLM evaluations to broadly explore the search space and reserving costly real experiments only for regions with high uncertainty, LABO achieves more sample-efficient optimization. We provide a theoretical analysis with a cumulative regret bound that formalizes this efficiency gain. Empirical results across diverse scientific tasks demonstrate that LABO consistently outperforms existing methods under identical experimental budgets. Our results suggest that LABO offers a practical and theoretically grounded approach for integrating LLMs into scientific discovery workflows.

[AI-86] Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

链接: https://arxiv.org/abs/2605.22047
作者: Chen Zhan,Xihe Qiu,Xiaoyu Tan,Xibing Zhuang,Gengchen Ma,Yue Zhang,Shuo Li,Peifeng Liu,Xiaoxiao Ge,Liang Liu,Lu Gan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.

[AI-87] Secure and Parallel Determinant Computation for Large-Scale Matrices in Edge Environments

链接: https://arxiv.org/abs/2605.22039
作者: Prajwal Panth
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Mathematical Software (cs.MS)
备注: 15 pages, 7 figures, 5 tables. This paper was first made public in October 2024 and subsequently posted as v1 on TechRxiv (Dec 10, 2025): this https URL . The present arXiv submission is identical to that version (v1)

点击查看摘要

Abstract:The advent of edge computing has enabled resource-constrained clients to delegate intensive computational tasks to distributed edge servers, especially within Internet of Things (IoT) environments. Among such tasks, Matrix Determinant Computation (MDC) remains critical for applications in control systems, cryptography, and machine learning. However, the cubic complexity of traditional determinant algorithms makes them unsuitable for real-time processing in constrained edge scenarios. We propose a Secure Parallel Determinant Computation (SPDC) framework, which provides strong security guaranties, including privacy-preserving MDC, across N distributed edge servers. The framework achieves privacy through Composite Element Distortion (CED) - a lightweight encryption method that combines Element-wise Obfuscation (EWO) and the Panth Rotation Theorem (PRT) to conceal both structural and numerical matrix content while preserving determinant properties. Parallel LU decomposition is used to distribute encrypted matrix blocks across an arbitrary number of untrusted edge servers, enabling efficient and scalable determinant computation. A one-way communication model further reduces coordination overhead by eliminating inter-server interactions. To ensure result integrity with minimal client burden, we further introduce two verification algorithms: Q_2, a probabilistic scalar method, and Q_3, a deterministic and low-complexity alternative. Mathematical analysis demonstrates that the proposed framework provides strong privacy and security guaranties, low computational overhead, and deployment flexibility - making it well-suited for secure, scalable, and real-time MDC in distributed edge-assisted systems. Comments: 15 pages, 7 figures, 5 tables. This paper was first made public in October 2024 and subsequently posted as v1 on TechRxiv (Dec 10, 2025): this https URL. The present arXiv submission is identical to that version (v1) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Mathematical Software (cs.MS) Cite as: arXiv:2605.22039 [cs.DC] (or arXiv:2605.22039v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.22039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-88] From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents

链接: https://arxiv.org/abs/2605.21996
作者: Murong Ma,Tianyu Chen,Yun Lin,Shuai Lu,Qinglin Zhu,Yeyun Gong,Zhiyong Huang,Peng Cheng,Yan Lu,Jin Song Dong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on long teacher trajectories is the dominant way to instill investigation and reasoning in open software-engineering (SWE) agents. Since every retained response becomes an imitation target, the student inherits the final outcome and intermediate flaws, including ungrounded leaps and redundant loops. High-quality training data must be effective(each step is grounded and narrows the agent’s epistemic gap to the correct fix) and efficient(each step is information-bearing rather than redundant or looping). Existing recipes filter or relabel teacher rollouts using only a binary terminal verifier, which does not directly target these axes and provides no supervision on instances where the teacher fails. Most real issue includes a developer-authored reference patch, p^\star , revealing the file paths, runtime behaviors, and coding conventions presupposed by the correct fix, yet standard pipelines discard it. We propose Patches-to-Trajectories (P2T), which uses p^\star as privileged information during curation and formulates trajectory construction as bi-objective optimization over per-step effectiveness and trajectory length. A reverse phase distills p^\star into a latent process graph, G^\star , of contextual facts and solution milestones. A forward phase curates trajectories from blinded teacher continuations by scoring per-step progress against G^\star under a leakage-blocking groundedness check and retaining the shortest effective segments. Using only 1.8k curated SWE-Gym instances, P2T improves effectiveness and efficiency over outcome-filtered SFT and its tool-error-masking variant. On SWE-bench Verified, it raises Pass@1 by up to 10.8 points while reducing per-instance inference cost by ~15%, with consistent gains on SWE-bench Lite. Size-matched ablations and qualitative analysis further isolate trajectory quality from data scale. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.21996 [cs.SE] (or arXiv:2605.21996v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.21996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-89] Ex-GraphRAG : Interpretable Evidence Routing for Graph-Augmented LLM s

链接: https://arxiv.org/abs/2605.21994
作者: Yoav Kor Sade,Arvindh Arun,Rishi Puri,Steffen Staab,Maya Bechler-Speicher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GraphRAG conditions language models on subgraphs retrieved from knowledge graphs, encoded via message-passing GNNs. Because these encoders entangle node contributions through iterated neighborhood aggregation, there is no closed-form way to determine how much each retrieved entity influenced the encoder’s output, and therefore no way to faithfully audit what structural evidence actually reached the model. We introduce Ex-GraphRAG, which replaces the GNN encoder with a Multivariate Graph Neural Additive Network (M-GNAN), an extension of additive graph models to high-dimensional embedding spaces that yields an exact decomposition of the encoder’s output across individual nodes and feature groups, without post-hoc approximation. On STaRK-Prime, this auditable encoder matches black-box performance. Using it to audit evidence routing, we uncover a semantic-structural mismatch: the nodes that dominate the encoder’s output are structurally disconnected in the retrieved subgraph, held together by low-attribution intermediaries whose removal degrades multi-hop QA by up to 28%. This mismatch, invisible to any opaque encoder, reveals that semantic importance and structural connectivity are governed by disjoint sets of nodes, with direct implications for retrieval pruning, context construction, and failure diagnosis in graph-augmented LLMs.

[AI-90] ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

链接: https://arxiv.org/abs/2605.21993
作者: Miaobo Hu,Shuhao Hu,BoKun Wang,Yina Sa,Xin Wang,Xiaobo Guo,Daren Zha,Jun Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.

[AI-91] Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables EMNLP2026 ACL

链接: https://arxiv.org/abs/2605.21974
作者: Jingxuan Qi,Zhiqiang Ye,Yuxiang Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages main body, 18 pages appendices. Submitted to EMNLP 2026 via ACL Rolling Review (ARR). Corresponding author: Yuxiang Feng (yxfeng@scut. this http URL ). Code and data available at this https URL

点击查看摘要

Abstract:An extraction schema should not reduce knowledge graph fidelity. On statistical CSV, however, it can. We study country-by-year time-series matrices, a common layout on open-data portals. In this setting, serialization format and schema constraints interact super-additively. Their joint effect exceeds the sum of independent effects by up to +1.180 (2x2 factorial, 6 datasets). Bootstrap 95% CIs are strictly positive on 4/6 datasets, with strongest evidence on wide Type-II matrices. More critically, a schema applied to a mismatched format can trigger catastrophic mismatch. Fact coverage falls below the unconstrained baseline on 4/6 datasets through entity inflation or extraction refusal. We call this observed pattern format-constraint coupling. Probing and token ablation support a surface-form anchoring explanation centred on column-name references. Controlled variants across format-schema pairings, GraphRAG hosts, and LLM families show the same direction within the measured scope; one LLM family shows only partial activation. The observation also has a diagnostic consequence. Three standard retrieval modes largely mask construction quality (delta = 1pp), whereas direct graph access exposes gaps up to +47.6pp (p 0.0001). To support fidelity-aware evaluation, we release CSVFidelity-Bench. It contains 15 datasets, 11 Type-II matrices, 4 Type-III tables, and 1,892 Gold Standard facts across 6 domains.

[AI-92] ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data

链接: https://arxiv.org/abs/2605.21963
作者: Jiangyuan Wang,Xuyong Chen,Junwei He,Xu Xu,Shasha Xie,Fuman Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Long-horizon clinical simulation – predicting how a patient’s physiology evolves over years under specified interventions – is central to chronic-disease care, yet existing electronic health record (EHR) models are predominantly discriminative, and general-purpose large language models drift under repeated interventions. We propose the \textbfChronoMedicalWorld Model (CMWM), an action-conditioned latent world-model framework for learning patient trajectories from longitudinal care data. CMWM couples a joint-embedding state encoder with a wide action encoder that admits both structured intervention indicators and free-text communication embeddings, and trains a recurrent latent transition module under a six-term objective: next-observation supervision, next-latent prediction, SIGReg latent regularisation, and three physiology-aware shape priors (slope, continuity, large-jump penalty). A closed-loop rollout-prefix protocol matches training to deployment, so the model is optimised against the same multi-step error it exhibits at inference. As a concrete case study, we instantiate CMWM for annual estimated glomerular filtration rate (eGFR) trajectory forecasting in chronic kidney disease (CKD). On a 2,232-patient nephrology cohort, the CKD instantiation achieves a dynamic-50% history rollout test mean absolute error (MAE) of 7.384 and root-mean-square error (RMSE) of 10.256, against 7.964 and 11.069 for a tuned GPT-5.5 structured-prompting baseline ( -7.28% MAE, -7.35% RMSE), with the gain dominated by the dialogue portion of patient–health-coach communication. The framework is not CKD-specific: its architecture, loss design, and training protocol apply to any chronic condition that can be cast as periodic clinical state interleaved with structured and conversational interventions.

[AI-93] CausalGuard: Conformal Inference under Graph Uncertainty

链接: https://arxiv.org/abs/2605.21928
作者: Vikash Singh,Weicong Chen,Debargha Ganguly,Yanyan Zhang,Nengbo Wang,Sreehari Sankar,Mohsen Hariri,Alexander Nemecek,Chaoda Song,Shouren Wang,Biyao Zhang,Van Yang,Erman Ayday,Jing Ma,Vipin Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.

[AI-94] Engineering Hybrid Physics-Informed Neural Networks for Next-Generation Electricity Systems: A State-of-the-Art Review

链接: https://arxiv.org/abs/2605.21903
作者: Joseph Nyangon
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 59 pages, 6 Figures

点击查看摘要

Abstract:The integration of machine learning with domain-specific physics is transforming the design, monitoring, and control of electricity systems, where data scarcity, limited interpretability, and the need to enforce physical laws constrain purely data-driven models. Physics-informed machine learning (PIML) addresses these limitations by embedding governing equations directly into the learning process, yielding accurate, efficient, and scalable solutions for Industry 4.0 applications. This article reviews hybrid PIML architectures for electricity systems, including physics-informed neural networks (PINNs), Deep Operator Networks (DeepONets), Fourier Neural Operators, Extreme Learning Machine-enhanced PINNs, graph-based PINNs (PIGNNs), and domain-decomposition PINNs. Each approach is examined through case studies spanning field analysis, fault detection, digital twins, surrogate modeling, and control optimization. The review shows that embedding Maxwell’s equations and other first-principles constraints substantially improves predictive accuracy under sparse and noisy data, reduces simulation time by orders of magnitude relative to finite element methods, and enhances generalization across operating regimes. Hybrid frameworks consistently outperform purely data-driven baselines on parameter sensitivity, dynamic behavior, and robustness, while supporting real-time digital-twin calibration and uncertainty quantification. Persistent challenges include training instability for stiff multi-scale problems, computational cost of high-fidelity models, and the absence of standardized benchmarks. The findings demonstrate that PIML enables a paradigm shift from black-box data-driven methods to transparent, physics-informed strategies, positioning the field for sustained innovation in resilient and intelligent electricity systems.

[AI-95] EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

链接: https://arxiv.org/abs/2605.21862
作者: Chushan Zhang,Ruihan Lu,Jinguang Tong,Xuesong Li,Yikai Wang,Hongdong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbfScene Predictor supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

[AI-96] he Illusion of Reasoning : Exposing Evasive Data Contamination in LLM s via Zero-CoT Truncation

链接: https://arxiv.org/abs/2605.21856
作者: Yifan Lan,Yuanpu Cao,Hanyu Wang,Lu Lin,Jinghui Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model’s generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model’s intrinsic problem-solving capabilities, ZCP compares the model’s zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at this https URL.

[AI-97] OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

链接: https://arxiv.org/abs/2605.21851
作者: Yu Li,Rui Miao,Tian Lan,Zhengling Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model’s belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a running estimate of the success probability at every position, together with a token-level advantage that requires no learned value network and no additional rollouts. A first-order analysis factorizes the advantage into the per-token discrimination signal used by distillation methods modulated by a state weight that concentrates credit on genuinely pivotal tokens, with a directional variance-reduction guarantee. The framework admits two estimators differing only in which model scores the evidence: a \textitself-oracle that reuses the student and recovers the on-policy distillation reward as a strict special case, and a \textitteacher-oracle that delegates scoring to a stronger frozen model. On two base LLMs across seven mathematics, science, and code reasoning benchmarks, OPPO improves over GRPO, DAPO, and SDPO by up to +6.0 points on AMC’23 and +5.2 points on AIME’24, with gains that widen monotonically with response length.

[AI-98] FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

链接: https://arxiv.org/abs/2605.21832
作者: Xinhang Yuan,Zexi Huang,Anjia Cao,Xudong Lu,Zikai Wang,Penghao Zhou,Chang Liu,Wentao Guo,Qinglei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID couples a cross-domain multimodal encoder, jointly trained on short videos and livestreams to produce discrete hierarchical codes (LUCID), with a late-fusion, ID-free design that injects slice-level and room-level LUCID as independent tokens, stabilized by a staged warmup under online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

[AI-99] Implicit Safety Alignment from Crowd Preferences ICML2026

链接: https://arxiv.org/abs/2605.21822
作者: Qian Lin,Daniel S. Brown
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. Conference paper

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

[AI-100] What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Frag mented Construct

链接: https://arxiv.org/abs/2605.21778
作者: Meryl Ye,Lujain Ibrahim,Jessica Y. Bo,Myra Cheng,Ida Mattsson,Daniel Vennemeyer,Robert Kraut,Steve Rathje
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI sycophancy has become a prominent concern in large language model (LLM) research. Yet the term lacks a consistent definition and has been applied to behaviors ranging from agreeing with a user’s false claim to excessively praising the user to withholding corrective feedback. When researchers, companies, and policymakers use the same term to describe different behaviors, evaluation results become difficult to compare, mitigation strategies fail to transfer, and systems that are resistant to one form of sycophancy continue exhibiting other forms. To address this, we make two contributions. First, we reviewed 70 papers on AI sycophancy to develop a taxonomy of how the behavior has been defined and measured. The taxonomy distinguishes (1) whether a model is sycophantic toward a user’s positions and beliefs, or toward the user’s broader personal traits and emotions, and (2) whether this occurs through explicit, direct language or more implicit, subtle behaviors such as framing, omission, or tone. Mapping existing literature to our taxonomy reveals that current research has focused on overt forms of sycophancy toward users’ beliefs, leaving more subtle and person-directed behaviors relatively understudied. Second, we surveyed 106 experts in AI sycophancy and related fields to examine whether researchers agree on which model behaviors are sycophantic. While experts are nearly unanimous in believing that sycophancy is a significant problem in current AI systems (94.3% agree), they disagree substantially on which specific behaviors qualify. Together, these findings demonstrate that AI sycophancy is a broad family of behaviors with different measurement challenges, intervention requirements, and governance implications. Our taxonomy provides a shared vocabulary for understanding and addressing these behaviors.

[AI-101] A Causal Argumentation Method for Explainability of Machine Learning Models

链接: https://arxiv.org/abs/2605.21758
作者: Henry Salgado,Meagan R. Kendall,Martine Ceberio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To be published in The 4th World Conference on eXplainable Artificial Intelligence

点击查看摘要

Abstract:Explainable AI (XAI) methods identify which features are relevant to a model’s predictions but often fail to clarify why certain decisions are made. In this work, we present a novel method that integrates causality with argument-based reasoning to explain why models may be making predictions. Our approach first identifies causal relationships among variables using causal discovery methods and then translates these into a Bipolar Argumentation Framework (BAF) to represent supportive and opposing interactions among features. By using semi-stable semantics, we find extensions of features that explain why certain outcomes may have been chosen. We demonstrate our method on two benchmark datasets and compare its results against standard post-hoc explainability approaches.

[AI-102] PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation

链接: https://arxiv.org/abs/2605.21752
作者: Blake Gella,Wei Wu,Yuhao Yin,Zexi Huang,Zikai Wang,Emily Liu,Junlin Zhang,Wentao Guo,Qinglei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems trained on user interaction data are susceptible to behavioral intensity imbalance–a systematic distortion arising from heterogeneous engagement patterns across users. This imbalance skews feedback signals such that observed interactions no longer faithfully reflect true preferences, causing models to disproportionately amplify signals from highly active users while underrepresenting others, which ultimately degrades recommendation quality and robustness at scale. To address this issue, we propose a nonparametric contrastive percentile approximation framework, PEARL, that models relative preference signals instead of absolute engagement magnitudes. Building upon relative advantage debiasing, PEARL leverages real contrastive interaction samples to approximate percentile relationships directly, without relying on auxiliary distribution estimation models. We provide theoretical justification demonstrating that such pairwise comparisons yield unbiased estimates of percentile-based preference signals. For broader applicability, we introduce a prediction-based bootstrapping mechanism for percentile smoothing to handle sparse and discrete feedback, alongside a generalized value-weighted formulation and a co-training strategy to enhance both modeling flexibility and representation learning. Extensive offline experiments demonstrate that PEARL effectively mitigates behavioral bias and consistently improves recommendation performance across multiple ranking targets. Deployed in a production livestream platform with a combined user base of billions, online A/B testing confirms substantial real-world gains: +2.10% Watch Duration, +0.80% Consumption Amount, +1.49% Interaction Rate, and -6.91% Report Rate.

[AI-103] Who Uses AI? Platforms Workforce and AI Exposure

链接: https://arxiv.org/abs/2605.21743
作者: Michelle Yin,Burhan Ogut
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:A growing literature uses artificial intelligence platform conversation logs to measure occupation exposure. We show that these scores partly measure platform user base rather than the workforce. Holding outcome, sample, controls, and estimator fixed while varying only the platform input changes the post-ChatGPT employment coefficient by a factor of 1.9, and within-vendor consumer-versus-enterprise channels produce estimates that disagree in sign. Reweighting to Bureau of Labor Statistics workforce shares attenuates estimates by 42 to 93 percent. We formalize the non-classical measurement error, derive probability limits and partial-identification bounds for employment elasticities. The bias understates substitution more than augmentation.

[AI-104] SMDD-Bench: Can LLM s Solve Real-World Small Molecule Drug Design Tasks?

链接: https://arxiv.org/abs/2605.21740
作者: Kevin Han,Renfei Zhang,Kathy Wei,Hamed Mahdavi,Niloofar Mireshghallah,Amir Farimani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at this http URL .

[AI-105] AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

链接: https://arxiv.org/abs/2605.21739
作者: Kate M. Lubrano,Faisal Sayed,Ankita Rathod,Akshansh,Craver Corbyn Thomas-Smith,Mark E. Whiting,Karina Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others’ emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer and respond to a participant’s emotional state over the course of a real conversation. We introduce AttuneBench, a benchmark grounded in 200 genuine multi-turn human-model conversations in which participants conversed with anonymized LLMs and provided turn-by-turn annotations of their emotional state, the model’s behavior, and their preferred responses. Across 11 evaluated models, we find that model rankings on emotion recognition, behavioral classification, preference prediction, and judged response quality are largely independent, indicating that emotionally intelligent behavior decomposes into separable capabilities. Preference alignment and response-quality judgments are substantially more model-discriminating than emotion-label accuracy. These results indicate that emotionally intelligent behavior requires predicting what kind of response a specific user wants in context, a distinction that aggregate scoring can obscure and that single-turn or synthetic formats cannot directly capture across turns. AttuneBench provides a framework for assessing each of these capabilities and for diagnosing model-specific strengths and failure modes in emotionally salient conversation.

[AI-106] BP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

链接: https://arxiv.org/abs/2605.21724
作者: Anton Lyubinin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyper-Connections (HC) improve residual networks by introducing learnable mixing across multiple residual streams, but unconstrained mixing leads to training instability. Manifold-Constrained Hyper-Connections (mHC) address this by enforcing approximate double stochasticity via Sinkhorn normalization, while mHC-lite ensures exact constraints through convex combinations of permutation matrices at the cost of factorial complexity. KromHC reduces this cost using Kronecker-product parameterizations, but restricts the mixing matrices to a structured submanifold of the Birkhoff polytope . We propose Transportation Birkhoff Polytope (TBP) parameterizations and their Recursive variants (RTBP), which construct exactly doubly stochastic mixing matrices with (n-1)^2 degrees of freedom. Our approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope. Empirical results on language model pre-training’ demonstrate competitive performance with improved stability and scalability. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.21724 [cs.LG] (or arXiv:2605.21724v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.21724 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-107] Latent-space Attacks for Refusal Evasion in Language Models

链接: https://arxiv.org/abs/2605.21706
作者: Giorgio Piras,Raffaele Mura,Fabio Brau,Maura Pintor,Luca Oneto,Fabio Roli,Battista Biggio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model’s residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work’s difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.

[AI-108] PocketAgents : A Manifest-Driven Library of Autonomous Defense Agents

链接: https://arxiv.org/abs/2605.21694
作者: Sidnei Barbieri,Ágney Lopes Roth Ferraz,Lourenço Alves Pereira Júnior
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Connecting large language models (LLMs) to defensive enforcement requires more than asking a model whether an attack is happening. A defender must decide which model outputs may change the system state, which outputs must be rejected, and how failures should be recorded. We present PocketAgents, a manifest-driven library of autonomous defense agents. Each agent is installed as three data files: a manifest, a prompt, and a runtime context. The shared runtime gives the agent bounded telemetry access and accepts only typed reports whose requested action appears in the manifest. We implemented PocketAgents on top of a cyber arena (Perry), a cyber-deception testbed, and evaluated two agents, Command and Control and Exfiltration, in 18 closed-loop trials of a DarkSide-inspired attack on a small enterprise topology. Thirteen trials produced validated network-block actions and contained the attack; four failed schema validation; one produced a valid no-action decision. The experiments show that a typed boundary makes LLM-driven defense measurable, extensible, and attributable.

[AI-109] Investigating Concept Alignment Using Implausible Category Members

链接: https://arxiv.org/abs/2605.21683
作者: Sunayana Rane,Brenden M. Lake,Thomas L. Griffiths
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., “Is a car a vehicle?”) is likely to recall patterns in the model’s vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., “Is an olive a vehicle?”) to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems’ assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare these assignments to those made by human participants on the full range of within-category and cross-category assignment tasks. Our results reveal a range of concepts for which which models differ in meaningful and surprising ways from humans, including treating “words” as belonging to categories like “vehicles” and “clothing,” identifying several “vegetable” category members as “fruit,” and assigning exemplars from non-weapon categories to the “weapons” category. We also demonstrate how these instances of concept misalignment translate into problematic downstream behavior with implications for AI safety.

[AI-110] AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agent ic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs)

链接: https://arxiv.org/abs/2605.21645
作者: Virginia K. Hench,J. Harry Caufield,Sierra A.T. Moxon,Jason M. O’Brien,Stephen W. Edwards
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 7 Figures and 3 Supplemental Figures

点击查看摘要

Abstract:Adverse Outcome Pathways (AOP) are logic models that causally link biological mechanisms that can be measured in a lab to adverse outcomes, relevant to chemical regulatory endpoints. AOPs contextualize new approach methodologies (NAMs), in vitro and in silico methods used as alternatives to animal testing and the sequential events in an AOP serve as multi-scale models spanning biological scales. The AOP-Wiki serves as the global repository for AOPs. While the AOP-Wiki has played a central role in AOP expansion over the past decade, constraints within the current data model and application infrastructure limit the AOP-Wiki from supporting continued AOP growth and evolution. Yet, the transformative power of agentic AI has re-invigorated AOP-Wiki data modernization efforts at a time when core AOP principles can be harnessed to inform use of AI for aggregating and structuring AOP-relevant information. Seizing upon this momentum, we present AOP-Wiki EMOD 3.0, the third in a series of evidence model prototypes, which concretely demonstrates data model expansions and our vision for how the AOP-Wiki might be transformed to better serve regulatory science and emergent use of AOPs in biomedical and One Health contexts. We aim to lay a foundation to support computationally-generated AOPs and quantitative AOPs (qAOPs) by focussing on solutions for AOP-Wiki internal quality improvement, evidence structuring to enhance AOP FAIRness and AI-readiness, and improved integration between the AOP framework and NAMs to better serve next generation risk assessment.

[AI-111] MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

链接: https://arxiv.org/abs/2605.21630
作者: Haiyang Shen,Taian Guo,Xuanzhong Chen,Mugeng Liu,Weichen Bi,Wenchun Jing,Sixiong Xie,Zhuofan Shi,Yudong Han,Chongyang Pan,Siqi Zhong,Jinsheng Huang,Ming Zhang,Yun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress. Comments: 27 pages, 4 figures, preprint

点击查看摘要

Abstract:Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier-level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem’s construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution-aligned sampling to encourage diverse reasoning coverage. Finally, a rollout-based judging stage labels generated questions by difficulty and supplies judged-correct responses for supervised fine-tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine-tuned on MindLoom-generated data achieves favorable performances over base models, distillation, and external-data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open-sourced our implementation at this https URL.

[AI-112] he Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

链接: https://arxiv.org/abs/2605.21623
作者: Itamar Trainin,Renana Keydar,Amit Pinchevski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Researchers in Holocaust studies have often distinguished between two styles of oral survivor testimony: the USC Shoah Foundation’s interviews tend to follow a structured, interviewer-guided format, whereas the Yale Fortunoff Video Archive generally favors a more free-form, open-ended style. This distinction has influenced both scholarly research and the development of later archives. In this study, we critically examine that claim by conducting a large-scale computational analysis of more than 1,600 testimonies from both collections. Leveraging discourse segmentation, topic modeling, and large language model (LLM) based analysis, we quantify the “structuredness” level of testimonies through topic coherence, interviewer-survivor dynamics, and the distribution of question types. Our results generally corroborate the structural differences identified in earlier research, while also revealing significant overlaps between the collections, both within individual interviews and across common narrative patterns. This complicates the simple “structured vs. free-form” dichotomy often applied to these oral histories. Beyond revisiting a foundational claim in Holocaust studies, our work provides a scalable, replicable framework for comparative corpus analysis. As a proof of concept, it suggests broader applications for digital oral history, narrative analysis, and the design of citizen-science annotation platforms.

[AI-113] O-Agents : A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

链接: https://arxiv.org/abs/2605.21622
作者: Isabella A. Stewart,Hongrui Chen,Faez Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the ASME 2026 International Design Engineering Technical Conferences (IDETC2026)

点击查看摘要

Abstract:Topology optimization can generate efficient structures, but designers often must manually translate qualitative intent, such as desired visual style, product experience, or manufacturability into solver settings that are not directly tied to those preferences. We present TO-Agents, a multi-agent AI framework that connects natural-language design intent with iterative topology optimization. The framework converts a human-provided problem description into validated solver inputs, runs a topology optimization solver, renders the resulting 3D topology, and uses multi-view vision-language reasoning with an independent judge agent to critique each result and revise solver parameters. We evaluate the framework on two long-horizon design tasks: a cantilever beam benchmark and a phone-stand product design. In both tasks, the designer specifies an aesthetic preference for hierarchically branched structures inspired by natural tree morphologies, and the system performs four revision cycles across ten independent replicates. TO-Agents produces at least one preference-aligned design in 60% of trials for each case study, corresponding to up to 6x more successful trials than an ablated pipeline without visual or historical feedback. Judge scores and human evaluations show that the pipeline can identify effective parameter levers, recover from poor revisions, and expand design exploration. A manufacturing agent further post-processes top-ranked designs for additive manufacturing, enabling end-to-end intent-to-prototype design. We also identify failure modes, including overshooting, selective memory, misplaced tools, and incorrect parameter reasoning. These results suggest that agentic topology optimization can shift designers from low-level parameter tuning toward higher-level specification of form and function, while highlighting safeguards needed for reliable autonomous engineering design.

[AI-114] When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

链接: https://arxiv.org/abs/2605.21606
作者: Xiaogeng Liu,Xinyan Wang,Yingzi Ma,Yechao Zhang,Chaowei Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print. Code is available at this https URL

点击查看摘要

Abstract:On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

[AI-115] Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLM s

链接: https://arxiv.org/abs/2605.21602
作者: Dylan Feng,Pragya Srivastava,Cassidy Laidlaw
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem.

[AI-116] RefusalBench: Why Refusal Rate Misranks Frontier LLM s on Biological Research Prompts

链接: https://arxiv.org/abs/2605.21545
作者: Lukas Weidener,Marko Brkić,Mihailo Jovanović,Emre Ulgac,Aakaash Meduri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 34 pages, 4 figures, 12 tables (10 in main text, 2 in supplementary). Code and data: this https URL

点击查看摘要

Abstract:Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic’s API stack predicting refusal at OR = 21.03 (95% CI: 14.58-30.34 prompt-clustered; 5.70-77.55 under model-clustered GEE). This effect is best read as access-path-level rather than model-weight-level: 99.8% of Anthropic’s strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case-by-case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden’s J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7’s J dropped 65% from prior versions with no improvement in dual-use detection. Nine of 18 frontier models exhibit a hedge-but-help partial-compliance pattern at dual-use tier that binary refusal metrics cannot detect.

[AI-117] Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLM s

链接: https://arxiv.org/abs/2605.21541
作者: Leitao Yuan,Qinghua Mao,Daizong Liu,Kun Wang,Wenjie Wang,Yan Teng,Jing Shao,Dongrui Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) remain vulnerable to transfer-based targeted attacks, where perturbations optimized on open-source surrogate encoders can generalize to closed-source MLLMs. A key challenge for improving adversarial transferability is to effectively capture the intrinsic visual focus shared across different models, such that perturbations align with transferable semantic cues rather than surrogate-specific behaviors. However, existing methods suffer from spatial-domain feature redundancy and surrogate-specific gradient signals, thereby hindering cross-model transferability. In this paper, we propose FRA-Attack, which addresses both challenges from a unified frequency-domain regularization perspective. For feature alignment, a high-pass DCT objective on patch features suppresses redundant global structures and concentrates the loss on the high-frequency band that carries the MLLMs’ intrinsic visual focus. For gradient optimization, we introduce Frequency-domain Gradient Regularization (FGR), a \textitmodel-agnostic low-pass regularizer that modulates the surrogate gradient using only the geometric frequency coordinate, \textiti.e., no surrogate-derived statistic is involved, so that FGR is model-agnostic by construction, removing surrogate-specific high-frequency artifacts while preserving transferable low-frequency directions. Together, the two components form a unified frequency-domain treatment of transferability. Extensive experiments on 15 flagship MLLMs across 7 vendors show that FRA-Attack achieves superior cross-model transferability, particularly with state-of-the-art performance on GPT-5.4, Claude-Opus-4.6 and Gemini-3-flash.

[AI-118] A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

链接: https://arxiv.org/abs/2605.21528
作者: Rui Huang,Lican Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and reproducible disease risk prediction remains challenging due to heterogeneous features, limited samples, and severe class imbalance. This study introduces yvsoucom-iterkit, a deterministic and log-driven automated machine learning framework that formulates pipeline optimization as a fully reproducible, configuration-level system. Each pipeline is encoded as a traceable log entity, enabling analysis of component attribution, interactions, similarity, and cross-seed robustness. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured and partially redundant search space, where performance is governed by a small subset of interacting components. Random Forest importance analysis identifies augmentation (0.454), model choice (0.198), and imbalance handling (0.101) as key drivers on Pima, while imbalance handling dominates Stroke (0.406). Component similarity analysis shows strong redundancy, with feature selection variants (biMax-biMean) exhibiting low RMS distance (0.0252), mixup closely matching no augmentation (0.0279), and TomekLinks aligning with no imbalance handling (0.0325), whereas Gaussian noise shows greater divergence from no augmentation (0.10). The framework achieves strong and stable performance using ensemble models (Weighted-F1 0.89, Macro-F1 0.88 on Pima; Weighted-F1 0.94 on Stroke), while Macro-F1 remains lower on Stroke (0.67) due to class imbalance. Cross-seed analysis reveals a performance-robustness trade-off, with ensembles showing lower variability (0.023-0.026) than SVM. These results indicate that effective AutoML optimization can focus on a reduced set of high-impact components.

[AI-119] Harnesses for Inference-Time Alignment over Execution Trajectories

链接: https://arxiv.org/abs/2605.21516
作者: Boyuan Wang,Bochao Li,Minghan Wang,Yuxin Tao,Fang Kong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

[AI-120] Predicting Performance of Symbolic and Prompt Programs with Examples

链接: https://arxiv.org/abs/2605.21515
作者: Chengqi Zheng,Keya Hu,Shuzhi Liu,Tao Wu,Kevin Ellis,Yewen Pu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.

[AI-121] Autonomous LLM Agents CTFs: A Second Look

链接: https://arxiv.org/abs/2605.21497
作者: Youness Bouchari,Matteo Boffa,Marco Mellia,Idilio Drago,Thanh Minh Bui,Dario Rossi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at DeMeSSAI Workshop @ IEEE EuroSP 2026

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human-level success rates in Capture-the-Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web-based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude-code, a general-purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude-code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general-purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude-code struggle in the same challenge categories, revealing persistent barriers that keep current agents below human-level capability. Third, by leveraging our manually designed architectures we can systematically measure the impact of additional components, finding that structured orchestration of specialized roles outperforms monolithic designs, improving run-to-run consistency, and reducing execution costs.

[AI-122] he Attribution Impossibility: No Feature Ranking Is Faithful Stable and Complete Under Collinearity

链接: https://arxiv.org/abs/2605.21492
作者: Drake Caraker,Bryan Arnold,David Rhoads
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
备注: 66 pages, 12 figures, 305 Lean 4 theorems. Code at this https URL

点击查看摘要

Abstract:No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine-verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist – faithful-complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) – and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1-rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto-optimal among unbiased aggregations, achieving the Cramer-Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics – a Z-test workflow and single-model screening tool – and has direct consequences for fairness auditing: SHAP-based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) – to our knowledge, the first formally verified impossibility in explainable AI.

[AI-123] High-speed Networking for Giga-Scale AI Factories

链接: https://arxiv.org/abs/2605.21187
作者: Sajy Khashab,Albert Gran Alcoz,Alon Gal,Jacky Romano,Rani Abboud,Yonatan Piasetzky,Lior Maman,Amit Nishry,Barak Gafni,Omer Shabtai,Matty Kadosh,Dror Goldenberg,Gilad Shainer,Mark Silberstein
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.

[AI-124] Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

链接: https://arxiv.org/abs/2605.22795
作者: Krishnakumar Balasubramanian
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based drifting velocity by a kernel density estimator (KDE)-gradient velocity, namely the difference of the kernel-smoothed data score and the kernel-smoothed model score. This velocity is a gradient field, addressing the non-conservatism issue identified for general displacement-based drifting fields. We prove continuous-time finite-particle convergence bounds for the conservative method on \R^d : a joint-entropy identity yields bounds for the empirical Stein drift, the smoothed Fisher discrepancy of the KDE, and the squared center velocity. The main finite-particle correction is a reciprocal-KDE self-interaction term, and we give deterministic and high-probability local-occupancy conditions under which this term is controlled. We keep the quadrature constants explicit and track their possible bandwidth dependence: the root residual-velocity rate N^-1/(d+4) holds under an additional h -uniform quadrature regularity condition, while a more general growth condition yields the optimized root rate N^-(2-\beta)/(2(d+4-\beta)) , where 0\le \beta2 . We also analyze the non-conservative drifting method with Laplace kernel, corresponding to the original displacement-based velocity proposed in~\citedeng2026drifting. For this method, a sharp companion kernel decomposes the velocity into a positive scalar preconditioning of a sharp-score mismatch plus a Laplace scale-mismatch residual, producing an analogous finite-particle rate with an unavoidable residual term. Finally, we explain how the continuous-time residual-velocity bounds translate into one-step generation guarantees through the explicit drift size \eta .

[AI-125] Incentive-Aligned Vehicle-to-Vehicle Energy Trading via Nash-Integrated Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2605.22363
作者: Yujin Lin,Yue Yang,Hao Wang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: The 24th IEEE International Conference on Industrial Informatics, 2026

点击查看摘要

Abstract:Vehicle-to-vehicle (V2V) energy trading enables decentralized peer-to-peer energy exchange among electric vehicles (EVs), reducing grid dependency while monetizing surplus capacity. However, coordinating self-interested EV agents with diverse charging needs and uncertain arrival-departure schedules remains challenging. Existing approaches either require centralized optimization with computational limitations or lack fairness guarantees. This paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, namely Nash-MADDPG, for incentive-aligned V2V energy trading. Nash bargaining determines efficient bilateral pricing, while Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Evaluation over 30-day continuous operation demonstrates an improvement of 61.6% in social welfare and 62.9% improvement in trading volume over Double Auction, while achieving superior fairness, such as 40.1% improvement in Jain’s index. Testing across 6-100 agents over a 30-day horizon with continuous vehicle turnover confirms scalability across population size and empirically stable pricing near the Nash Bargaining benchmark.

[AI-126] Atom-level Protein Representation Learning Improves Protein Structure Prediction

链接: https://arxiv.org/abs/2605.22133
作者: Taewon Kim,Hyosoon Jang,Hyunjin Seo,Seonghwan Seo,Hyeongwoo Kim,Wonho Zhung,Mingyeong Shin,Wooyoun Kim,Sungsoo Ahn
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment targets. Motivated by this, we study protein representations for predicting structures beyond conventional function annotation. We propose TriProRep, a structure-aware pretraining method that jointly models three aligned residue-level views: amino-acid identity, backbone geometry, and local full-atom geometry, discretely encoded via VQ-VAE tokenizers. By pretraining to recover original tokens from generator-corrupted views, TriProRep learns to distinguish plausible but incorrect cross-view augmentations from the original protein. We further introduce RepSP, a benchmark for evaluating protein representations in structure-predictive settings. RepSP tests three uses of representations: homodimer co-folding from apo-chain representations, residue-level prediction of homodimer-derived interaction properties, and representation-aligned monomer structure prediction. Across these tasks, TriProRep improves over sequence-only and prior structure-aware representation models, while maintaining competitive performance on conventional benchmarks.

[AI-127] hermodynamic Irreversibility of Training Algorithms

链接: https://arxiv.org/abs/2605.21933
作者: Liu Ziyin,Yuanjie Ren,Adam Levine,Isaac Chuang
机构: 未知
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:The training algorithms for AI systems all introduce far-from-equilibrium dynamical processes, and understanding the irreversibility of these algorithms is a fundamental step towards understanding the learning dynamics of modern AI systems. In this work, we establish a general framework for defining and analyzing the irreversibility of training algorithms. We show that four different ways to characterize the irreversibility of dynamical processes are equivalent to leading order in the step size \eta : numerical backward error \phi_\rm DE , time-renormalized correction \phi_\rm TR , microscopic time reversal asymmetry \phi_\rm TA , and the (regularized) stochastic-thermodynamic entropy production \phi_\rm ST . The irreversibility gives rise to a time-reversal-symmetry-breaking emergent force that generically breaks non-isometric continuous reparametrization symmetries, preserves orthogonal symmetries, and leads to a universal preference for those learning trajectories that minimize the entropy production rate.

[AI-128] Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

链接: https://arxiv.org/abs/2605.21789
作者: Aaron Wang,Zihan Zhao,Alan Xia,Chang Sun,Abhijith Gandrakota,Jennifer Ngadiuba,Richard Cavanaugh,Javier Duarte
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time jet tagging is critical for identifying short-lived particle decays in the high-throughput detectors of the Large Hadron Collider, where real-time trigger systems responsible for deciding which collision events to store impose strict latency and accuracy constraints. While transformer architectures achieve the highest jet tagging accuracy when compute is unconstrained, their quadratic self-attention cost makes inference restrictive on trigger budget. Existing efficient variants reduce the computational cost, but hinder the classification performance. To address this limitation, we introduce the Patch Hierarchical Attention Transformer (PHAT-JeT), which combines two mechanisms: a physics-inspired geometric message-passing module that encodes local detector-plane structure, and a hierarchical patch-based attention scheme that computes exact attention within small particle groups while preserving global context through lightweight patch-token communication. Within a restricted budget, PHAT-JeT achieves state-of-the-art accuracy and background rejection among all resource-constrained jet tagging models on four benchmarks (\textschls4ml, JetClass, Top Tagging, and Quark–Gluon). Our code is available at this https URL.

[AI-129] Support-aware offline policy selection for advertising marketplaces

链接: https://arxiv.org/abs/2605.21736
作者: Prashant Shekhar,Caroline Howard
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Logged advertising auctions make offline reserve-price evaluation attractive but risky. Replay tables can identify policies with large apparent yield gains, yet they can also hide weak threshold support, multiple-comparison effects, subgroup harm, and bidder-response uncertainty. Existing replay and off-policy evaluation methods estimate or rank policy values, but they do not directly answer the operational question of whether the available evidence is strong enough to justify validation. This paper develops a support-aware offline decision framework for reserve-policy selection. Rather than outputting a single point-estimate winner, the framework converts logged evidence into a conservative decision object consisting of certified policies, statistically dominated alternatives, and unresolved candidates requiring further validation. The main theoretical result gives a unified finite-catalog guarantee showing that, under simultaneous uncertainty control and conservative support gates, the framework preserves the best gate-passing policy while eliminating only policies with certified regret. Supporting results characterize support-localized replay generalization, establish information-theoretic threshold-resolution limits, and quantify when heterogeneous bidder response can overturn localized replay rankings. Experiments on iPinYou real-time-bidding logs show that the leading reserve rule achieves a 47.66% replay lift in season two, a 40.71% simultaneous lower-bound lift, and a 43.87% frozen out-of-time replay lift in season three. The framework reduces a 19-policy catalog to a two-policy validation shortlist while certifying non-harm across 44 advertiser, exchange, and region segments. The results support the central claim that offline reserve-policy evaluation should produce certified validation decisions rather than point-estimate rankings alone.

[AI-130] Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

链接: https://arxiv.org/abs/2605.21557
作者: Jongchan Park
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.

[AI-131] Local Covariate Selection for Averag e Causal Effect Estimation without Pretreatment and Causal Sufficiency Assumptions

链接: https://arxiv.org/abs/2605.21548
作者: Zeyu Liu,Zheng Li,Feng Xie,Yan Zeng,Hao Zhang,Kun Zhang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the problem of selecting covariates for unbiased estimation of the total causal this http URL approaches typically rely on global causal structure learning over all variables, or on strong assumptions such as causal sufficiency - where observed variables share no latent confounders - or the pretreatment assumption, which limits covariates to those unaffected by the treatment or outcome. These requirements are often unrealistic in practice, and global learning becomes computationally prohibitive in high-dimensional this http URL address these challenges, we propose a novel local learning method for covariate selection in nonparametric causal effect estimation that avoids both the pretreatment and causal sufficiency assumptions. We first characterize a local boundary that contains at least one valid adjustment set whenever one exists for identifying the causal effect, and then develop local identification procedures to efficiently search within this this http URL prove that the proposed method is sound and complete. Experiments on multiple synthetic datasets and two real-world datasets show that our approach achieves accurate causal effect estimation while substantially improving computational efficiency.

[AI-132] Protein Thoughts: Interpretable Reasoning with Tree of Thoughts and Embedding-Space Flow Matching for Protein-Protein Interaction Discovery

链接: https://arxiv.org/abs/2605.21522
作者: Kingsley Yeon,Xuefeng Liu,Promit Ghosal
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Protein-protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present \textbfProtein Thoughts, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue-level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis-guided entropy-regularized Tree-of-Thoughts search. A fine-tuned language model generates search directives from embedding-derived features, classifying candidates as high-priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy-driven exploration, while hypothesis-aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis-conditioned embedding-space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best-binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves 91.08 \pm 0.19 Micro-F1, outperforming existing PPI methods on the same dataset.

[AI-133] Visibility nowcasting in South Korea: a machine learning approach to class imbalance and distribution shift

链接: https://arxiv.org/abs/2605.21507
作者: Bong Gyun Shin,Chan Sik Lee,Hyesun Suh
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Published in Theoretical and Applied Climatology

点击查看摘要

Abstract:Atmospheric visibility is a critical variable for transportation safety and air quality management, however, accurate prediction remains challenging due to the complex interactions between meteorological conditions and air pollutants, as well as the rarity of low-visibility events. This study introduces a machine learning framework to nowcast visibility in six major South Korean cities. To handle the imbalance in the 2018-2020 training data, we applied the Synthetic Minority Over-sampling Technique with Nominal and Continuous (SMOTENC) and Conditional Tabular Generative Adversarial Network (CTGAN). An ensemble approach combining machine learning and deep learning models was then used and evaluated on a 2021 test dataset. The results revealed a marked decline in predictive performance in the test set compared to the cross-validation phase. This degradation was attributed to a distributional shift between training and testing periods, which was quantitatively confirmed by measuring the Wasserstein distance of the most influential feature identified by SHAP analysis. In general, this study presents a methodology that aims to simultaneously address the dual challenges of data imbalance and temporal distributional shifts, and emphasizes the necessity of accounting for evolving external environmental factors when implementing nowcasting models on time-series data.

[AI-134] Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

链接: https://arxiv.org/abs/2605.21504
作者: Sanjiv R Das,Taranag Goyal,Mohini Yadav
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 tables, 3 figures

点击查看摘要

Abstract:Using Chronos-2, an open-source time-series foundation model, we evaluate pretrained time-series models for economic and financial forecasting with an emphasis on whether multivariate (MV) inputs improve accuracy relative to univariate (UV) baselines. The study covers two panels – the Magnificent-7 equities and U.S. Treasury interest rates – as well as a combined panel, using rolling monthly evaluations from 2000–2025. We vary input window lengths and forecast horizons and report RMSE and MAPE. Across datasets, MV forecasts consistently outperform UV forecasts, with especially strong gains for interest rates and meaningful improvements for equities. Series-level comparisons show MV improvements in every case, and error dispersion is generally lower under MV inputs. We also provide parameter-heatmap and time-series visualizations. However, mixing time series across equity and interest rate markets reduces forecast accuracy, indicating that adding noisy context degrades model performance. Overall, the results indicate that foundation models can leverage cross-series information to improve forecast accuracy in finance, and that the benefits are strongest when related series are modeled jointly under disciplined rolling protocols. Other than using an open-source foundation model, this paper also showcases how AI may be used for financial research.

[AI-135] Graph neural network explanations reveal a topological signature of disease-associated hubs in biological networks

链接: https://arxiv.org/abs/2605.21502
作者: Kyle Higgins,Ivan Laponogov,Dennis Veselkov,Kirill Veselkov
机构: 未知
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages (excluding supplement), 7 figures, 7 supplementary tables

点击查看摘要

Abstract:Graph neural networks (GNNs) are increasingly used to model biological systems, yet the reliability of post-hoc explanation methods for recovering meaningful molecular mechanisms remains unclear. Here, we systematically evaluate four widely used approaches: Saliency Attribution (SA), Integrated Gradients (IG), GNNExplainer, and Layer-wise Relevance Propagation (LRP) for identifying disease-relevant structure in breast cancer RNA-seq data projected onto a protein-protein interaction network. Using synthetic benchmarks with known ground-truth motifs, we show that explanation methods recover distinct signal organizations: SA performs best for sparse single-node drivers, whereas IG and LRP preferentially recover distributed pathway-like and cascade-like signals. In TCGA BRCA data, we identify a consistent topological signature of disease-associated hubs in which attribution peaks in the immediate 1-hop neighborhood and decays across successive network shells, a pattern most pronounced for IG and LRP and associated with strong enrichment of known cancer hubs. We further observe a trade-off between local hub enrichment and global gene ranking performance, with IG optimizing local enrichment and SA achieving superior global discrimination. Motivated by these complementary behaviors, we introduce a framework combining a shell-based hub score with consensus ranking across explainers. Consensus scores improve prioritization of canonical cancer genes (TP53, BRCA1, ESR1, MYC), reduce dependence on node degree, and, especially when tuned, outperform individual methods. Pathway enrichment further reveals improved recovery of biologically coherent cancer programs, including ERBB2, RTK, MAPK, immune, and cytokine signaling. Together, these results demonstrate that topology-aware integration of graph explanations can improve biological interpretability and biologically relevant molecular recovery.

[AI-136] Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution

链接: https://arxiv.org/abs/2605.20348
作者: Christos Spyridon Koulouris,Carlo Campajola
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate whether deep reinforcement-learning agents interacting in a shared optimal-execution environment can sustain supra-competitive outcomes, in the sense of achieving lower implementation shortfalls than the relevant game-theoretical competitive benchmark. We study a two-agent Almgren-Chriss liquidation game and examine how learned behavior depends on intra-episode environment feedback, the ability to interpret the mid-price and the agent’s knoledge of the past. We first use ex-ante schedule-learning agents to remove intra-episode feedback and isolate what can arise when agents commit to complete liquidation trajectories before execution begins. We then allow agents to condition on the evolving state using a variety of DDQN architectures. We find that, when agents are given access to intra-episode history, especially recent prices and own past actions, supra-competitive outcomes become substantially more frequent and more persistent. These findings indicate that supra-competitive behavior in this execution game is driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction along the realized execution path.

机器学习

[LG-0] Integrable Elasticity via Neural Demand Potentials

链接: https://arxiv.org/abs/2605.22820
作者: Carlos Heredia,Daniel Roncel
类目: Machine Learning (cs.LG)
*备注: 44 pages, 7 figures

点击查看摘要

Abstract:We propose the Integrable Context-Dependent Demand Network (ICDN), a demand-first neural model for multiproduct retail demand. The model learns log-demand as a smooth, context-conditioned function of log-prices, allowing elasticities to be derived exactly from the learned demand surface. On the Dominick’s beer dataset, ICDN improves out-of-sample generalization over a directed log-log benchmark and yields more stable, economically plausible elasticity estimates, especially for weakly identified cross-price effects.

[LG-1] Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

链接: https://arxiv.org/abs/2605.22814
作者: Lily Goli,Justin Kerr,Daniele Reda,Alec Jacobson,Andrea Tagliasacchi,Angjoo Kanazawa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent’s predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at this https URL.

[LG-2] FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

链接: https://arxiv.org/abs/2605.22779
作者: Huanchi Wang,Zihang Huang,Yifang Tian,Kristina Dzeparoska,Hans-Arno Jacobsen,Alberto Leon-Garcia
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.

[LG-3] Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

链接: https://arxiv.org/abs/2605.22765
作者: Samson Gourevitch,Yazid Janati,Dario Shariatian,Umut Simsekli,Eric Moulines,Eric P. Xing,Alain Durmus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: preprint

点击查看摘要

Abstract:Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at this https URL. Comments: preprint Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.22765 [cs.LG] (or arXiv:2605.22765v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

链接: https://arxiv.org/abs/2605.22756
作者: Christian Janos Lebeda,David Erb,Tudor Cebere,Aurélien Bellet
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel (\varepsilon,\delta) -DP heavy hitter detection algorithm for hierarchical data, whose error is O_\varepsilon,\delta(\sqrt\log h) for trees of height h and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.

[LG-5] Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

链接: https://arxiv.org/abs/2605.22746
作者: Berk Hayta,Hannah Laus,Simon Mittermaier,Felix Krahmer
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Real-world sensor-based learning systems require uncertainty estimation that is both reliable and computationally efficient. Evidential Deep Learning (EDL) provides single-pass uncertainty estimation by modeling the class probabilities via Dirichlet distributions, where the Dirichlet parameters are predicted by a learned neural network mapping. However, this approach can lead to computational challenges, as Dirichlet expected objectives are more complex than standard supervised learning losses, complicating their analysis and implementation. We address this issue by approximating the objective of the first-order empirical risk minimization problem induced by EDL with a plug-in loss evaluated at the Dirichlet mean and show that, under mild assumptions, the approximation error decays with growing evidence for a broad class of loss functions, including mean-squared error and cross-entropy loss. As a special case, our analysis provides justification for the use of softmax in the context of uncertainty estimation, since under a particular evidence-to-Dirichlet mapping, our framework includes the standard softmax classifier. We validate the proposed simplified objectives on the Google Speech Commands dataset and show that they achieve predictive accuracy and selective prediction performance comparable to classical EDL, while being simpler to implement using standard deep learning losses and training pipelines. To the best of our knowledge, this empirical analysis is the first to obtain coverage-accuracy trade-offs for speech recognition tasks through EDL.

[LG-6] SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

链接: https://arxiv.org/abs/2605.22743
作者: Javad Parsa,Enis Simsar,Amir Joudaki,Thomas Hofmann,André M. H. Teixeira
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit expressiveness and concept fidelity. To address this trade-off, we propose Sequential regularized LoRA (SeqLoRA), a constrained continual learning framework that jointly optimizes both LoRA factors via bilevel optimization. Theoretically, we establish strong convergence guarantees for our algorithm and model the residual layer activations as a matrix sub-Gaussian process to derive high-probability bounds on catastrophic forgetting. We further prove that learning the LoRA basis from data minimizes residual interference energy more effectively than frozen-basis methods. Experiments on multi-concept image generation demonstrate that SeqLoRA improves identity preservation and scalability across up to 101 concepts, while avoiding costly fusion and reducing attribute interference in composed generations.

[LG-7] rnary Decision Trees with Locally-Adaptive Uncertainty Zones

链接: https://arxiv.org/abs/2605.22740
作者: William Smits
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 5 appendix sections. Submitted to Data Mining and Knowledge Discovery (DAMI)

点击查看摘要

Abstract:Decision trees partition the feature space using hard binary thresholds, assigning identical confidence to instances far from a decision boundary and to those directly on it. We introduce ternary decision trees, which augment each split node with an uncertainty zone of half-width delta centered on the optimal threshold. Instances in this zone receive predictions formed by weighted blending of both child subtrees and are flagged as boundary-uncertain, signaling that downstream applications may treat these predictions differently. Crucially, delta is computed locally at each node from statistics already available during standard CART split finding, requiring no external noise specification. We propose and evaluate five delta-estimation methods: quality-plateau (plateau width of the split criterion curve), class-overlap (empirical class-distribution overlap), gain-ratio (split quality relative to split entropy), node-bootstrap (threshold variance under node-level resampling), and margin (SVM-inspired distance to the nearest cross-class training example). Evaluated across 72 OpenML-CC18 datasets with 5-fold cross-validation, all five methods with probabilistic routing significantly outperform standard CART on decided accuracy (Wilcoxon signed-rank, p 0.001). The margin method achieves the best efficiency (0.104 accuracy gain per unit of boundary-uncertain flagging rate), wins on 42 of 72 datasets, and requires zero additional hyperparameters. Analysis on three Breiman synthetic benchmarks reveals that margin is self-calibrating on clean data while node-bootstrap and quality-plateau best track theoretical irreducible error. Experiments on four medical and financial datasets demonstrate practical value: on mammography, node-bootstrap achieves +0.71% decided accuracy by flagging 10.8% of screening cases as boundary-uncertain.

[LG-8] Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning

链接: https://arxiv.org/abs/2605.22724
作者: Adrien Weihs,Hayden Schaeffer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the approximation and statistical complexity of learning collections of operators in a shared multi-task setting, with a focus on the Multiple Neural Operators (MNO) architecture. For broad classes of Lipschitz multiple operator maps, we derive near-optimal upper bounds for approximation and statistical generalization. On the lower-bound side, we establish a curse of parametric complexity and prove corresponding minimax rates. Together, these results show that shared representations across tasks do not increase the overall cost: multi-task operator learning follows the same scaling laws as single operator learning. We also compare MNO with a multi-task extension of DeepONet based on concatenated task inputs and show that, from a worst-case approximation-complexity perspective, both architectures satisfy essentially the same asymptotic rates.

[LG-9] Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT -2 Small on Indirect Object Identification

链接: https://arxiv.org/abs/2605.22719
作者: Mahdi Nasermoghadasi
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen’s d| 0.8). The strongest single correlate of failure – feature 17,491, d=+2.93, Neuronpedia label ‘cryptographic keys’ – is essentially silent except when the prompt’s transferred object is ‘the keys,’ on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% - 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0–93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).

[LG-10] Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

链接: https://arxiv.org/abs/2605.22703
作者: Shuo Yang,Jinda Lu,Chiyu Ma,Kexin Huang,Haoming Meng,Qihui Zhang,Yuyang Liu,Bolin Ding,Guoyin Wang,Li Yuan,Jingren Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

[LG-11] Posterior Collapse as Automatic Spectral Pruning

链接: https://arxiv.org/abs/2605.22691
作者: Johannes Hirn
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注:

点击查看摘要

Abstract:We show that posterior collapse in \beta -VAEs implements automatic spectral pruning. A latent mode collapses if its contribution to reconstruction is below the cutoff set by \beta . Equilibrium solutions with different \beta thus reveal a cascade of collapses as latent modes decouple from least to most useful. We derive this as a consequence of the loss via a Landau stability analysis. We define a latent-rescaling-invariant order parameter that ranks active latent modes and whose collapse thresholds identify which effective variables to inspect first. In the linear Gaussian case, the collapse spectrum, utility spectrum, and normalized PCA spectrum coincide, and each collapse follows a mean-field law. We test these predictions on the WorldClim dataset. Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech) Cite as: arXiv:2605.22691 [cs.LG] (or arXiv:2605.22691v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22691 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] ChronoVAE-HOPE: Beyond Attention – A Next-Generation VAE Foundation Model for Specialized Time Series Classification

链接: https://arxiv.org/abs/2605.22684
作者: José Alberto Rodríguez,Luis Balderas,Miguel Lastra,Antonio Arauzo-Azofra,José M. Benítez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have become a new component of the state-of-the-art in general time series forecasting. However, adapting them to specialized classification tasks remains constrained by two interconnected challenges: the quadratic cost of standard attention mechanisms and the inability to disentangle the structural components underlying time series variability. This technical report introduces ChronoVAE-HOPE, a next-generation TSFM that reconciles massive generalization with structured latent representation for time series classification. The core of the proposal is a Variational Autoencoder (VAE) framework built upon the HOPE Block, which replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. A key architectural novelty is the disentangled latent space, which factorizes representations into independent trend and seasonal components via dedicated encoder heads and separate decoder pathways. ChronoVAE-HOPE undergoes self-supervised pre-training on the Monash archive, combining a Masked Time Series Modeling (MTSM) auxiliary objective with a disentangled VAE reconstruction loss. The pre-trained encoder is subsequently frozen and used to generate fixed-length embeddings for downstream classification on the UCR benchmark datasets. Empirical results demonstrate strong performance across diverse temporal domains, particularly in settings characterized by strict causal structure. ChronoVAE-HOPE establishes a robust and interpretable framework for the adaptation of foundation models to time series classification through structured generative representations.

[LG-13] he Secretary Problem with a Stochastic Precursor

链接: https://arxiv.org/abs/2605.22653
作者: Franziska Eberle,Alexander Lindermayr
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In learning-augmented online algorithms, predictions are usually valued for what they say: a value estimate, a solution, or an algorithmic recommendation. This paper shows that predictions can also be valuable solely due to their arrival time. We study the fundamental secretary problem augmented with a stochastic precursor: a content-free signal that is guaranteed to arrive no later than the best item, but is otherwise stochastically timed. The signal does not carry any additional information; nevertheless, its timing alone changes the structure of optimal stopping. We characterize optimal policies in the random-order and adversarial-order models. In random order, a single uniformly timed precursor already gives success probability at least \frac12 , improving on the classic \frac1e benchmark. With increasingly late precursors, the success probability approaches 1 . In adversarial order, for which traditional models do not admit strong guarantees, sufficiently concentrated precursors recover constant success guarantees. Our results show that such novel forms of asynchronous temporal information are a distinct and powerful form of advice in online decision making and may also be effective for other problems.

[LG-14] Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

链接: https://arxiv.org/abs/2605.22644
作者: Igor Ignashin,Anna Radovskaya,Andrew Semenov,Egor Lopatin,Stanislav Potapov,Aleksandr Kovalenko,Andrey Veprikov,Aleksandr Shestakov,Andrey Leonidov,Aleksandr Beznosikov
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker–Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.

[LG-15] A note on convergence of Wasserstein policy optimization

链接: https://arxiv.org/abs/2605.22622
作者: David Šiška,Yufei Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Wasserstein Policy Optimization (WPO) is a recently proposed reinforcement learning algorithm that leverages Wasserstein gradient flows to optimize stochastic policies in continuous action spaces. Despite its empirical success, the theoretical convergence properties of WPO in environments with continuous state and action spaces have yet to be fully established. In this note, we argue that WPO within the framework of entropy-regularised Markov Decision Processes converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobole inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.

[LG-16] UNAD: An Explainable Hybrid Framework for Unknown Network Attack Detection

链接: https://arxiv.org/abs/2605.22621
作者: Saif Alzubi,Frederic Stahl
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The detection of previously unseen network attacks remains a major challenge for intrusion detection systems. Although supervised learning methods often perform well on known attack classes, they are limited when new attack types are not represented in the training data. Unsupervised methods are more suitable for detecting zero-day attacks, as they do not require labelled attack samples, but they often suffer from high false positive rates, which limits their real-world usefulness. This paper presents UNAD+, an enhanced framework for unknown network attack detection derived from the previously proposed Unknown Network Attack Detector (UNAD). UNAD+ combines a benign-only unsupervised ensemble with Weighted Majority Voting (WMV), a supervised refinement stage trained on pseudo-labelled detections, and a post hoc explainability layer that provides both local and global explanations. The framework was evaluated on the CICIDS2017 and NSL-KDD benchmark datasets. The results show that UNAD+ improves on the original UNAD framework, achieving F1-scores above 98% across the benchmark datasets while significantly reducing false positives and enhancing transparency and deployment suitability through integrated explainability.

[LG-17] Evolutionary Multi-Task Optimization for LLM -Guided Program Discovery

链接: https://arxiv.org/abs/2605.22613
作者: Halil Alperen Gozeten,Xuechen Zhang,Emrullah Ildiz,Ege Onur Taga,Tara Javidi,Samet Oymak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent LLM-guided evolutionary search methods have shown that iterative program mutation can discover strong algorithms, but they typically optimize each task independently, even when related tasks share reusable structure. We introduce Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery, and propose EMO-STA (Shared-Then-Adapt), a two-stage framework that first evolves a shared archive of executable programs across a task family and then adapts selected shared candidates to each target task. Within EMO-STA, we explore multiple adaptation strategies, including warm-starting from the shared archive, adapting the best average shared program, and adapting the shared program that performs best on each target task. Across eight task families spanning continuous optimization, geometric construction, modeling, and algorithmic optimization, EMO-STA improves over matched-compute single-task evolution in most settings, with STA Best-Local providing the strongest in-distribution adaptation and STA Best-Shared yielding robust transfer to unseen tasks. Compute-allocation experiments show that allocating a substantial fraction of the family-level budget to shared evolution is consistently beneficial, with roughly balanced shared and adaptation budgets often being optimal. Beyond compute efficiency, we show that shared evolution can mitigate overfitting in low-evidence settings (e.g. few training data), including ARC tasks and time-series feature engineering, by favoring programs that generalize across all tasks rather than exploiting task-specific brittle artifacts.

[LG-18] Benchmarking Machine Learning Architectures for Antimicrobial Stewardship in Pediatric ICUs

链接: https://arxiv.org/abs/2605.22611
作者: Niklas Raehse,Luregn J. Schlapbach,Daphné Chopard
类目: Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, code: this https URL

点击查看摘要

Abstract:Antimicrobial stewardship (AMS) is critical in pediatric intensive care units (PICUs), where diagnostic uncertainty often drives broad-spectrum antibiotic use, increasing antimicrobial resistance and potential long-term harms. Machine learning offers a promising approach for identifying patient-level opportunities for stewardship interventions from electronic health record data, yet prior work has focused largely on adult populations and static tabular representations. We present a systematic benchmarking study of AMS intervention prediction in the PICU across a public dataset and a private institutional cohort. We define four clinically relevant proxy targets for reducing antibiotic exposure: intravenous-to-oral switching, de-escalation, discontinuation, and short-course therapy. Under a unified evaluation framework, we compare tabular, sequence-based, and graph-based temporal models at multiple temporal resolutions. We find that predictive performance is driven primarily by target prevalence and dataset characteristics rather than model complexity. Sequence models improve the precision-recall trade-off over tabular approaches at coarse (24-hour) resolution, while finer temporal modeling provides limited additional benefit. However, these gains come at the cost of poorer calibration, with simpler tabular models yielding more reliable probability estimates. Multi-task learning produces only marginal improvements, suggesting limited shared structure across stewardship targets. Our findings highlight the importance of target design, temporal representation, and calibration in clinical machine learning, and provide practical guidance for developing reliable decision support systems for pediatric AMS.

[LG-19] Factored Diffusion Policies:Compositionally Generalized Robot Control with a Single Score Network

链接: https://arxiv.org/abs/2605.22596
作者: Sayan Mitra,Ege Yuceel,Noah Giles,Abhishek Pai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic tasks are typically specified by a tuple of factors, such as the object to be grasped, the obstacles to be avoided, the color of the target, and so on. Collecting expert demonstrations for every combination of factor values grows combinatorially. We present factored diffusion policies: a single shared diffusion network trained with per-factor null-token dropout, whose score decomposes additively across factors at inference. Under approximate conditional independence between factors given the action-observation pair, this composition approximates the true joint score with a bounded uniform error, reducing the training-task budget from a product of factor cardinalities to a sum. A trajectory-tube certificate chains this score-level bound through the reverse-time sampling ODE and a contracting tracking controller into a closed-loop state-trajectory tube whose radius factors into an ODE-sensitivity constant and a per-factor score-error budget. Unlike compositional-diffusion methods for control that combine separately trained networks, we use one shared network. Drone racing experiments confirm both the generalization bound and the certificate. On state-based multi-gate racing, the factored policy passes 90% of held-out gates – matching an oracle – while a K-network composition baseline collapses to 3%; on vision-based single-gate traversal, it transfers zero-shot to an unseen venue with +11.7pp success-rate gain and 2.4X crash-rate reduction.

[LG-20] Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?

链接: https://arxiv.org/abs/2605.22593
作者: Pedro C. Vieira,Pedro Ribeiro,Viacheslav Borovitskiy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While deep ensembles are widely considered to be the default method for uncertainty quantification in deep learning, their effectiveness for graph-structured data is often simply assumed based on successes in domains like computer vision. We investigate standard deep ensembles specifically for message-passing graph neural networks. Benchmarking across seven datasets representing varied tasks and complexities, we reveal that ensembles provide surprisingly little improvement over a single model. Instead, the observed marginal gains stem primarily from stabilizing optimization noise in point predictions rather than yielding meaningfully better uncertainty estimates. Through an aleatoric-epistemic decomposition, we identify epistemic collapse: independently trained networks consistently converge to overly similar predictions. Because disagreement is the fundamental mechanism through which ensembles capture epistemic uncertainty, this lack of diversity neutralizes their key advantage. Analyzing this phenomenon further, we suggest this collapse is driven by functional rather than weight-space convexity, where distinct parameter solutions induce almost identical behavior. Our results suggest that deep ensemble success does not seamlessly transfer to graph machine learning.

[LG-21] GraphFlow: A Graph-Based Workflow Management for Efficient LLM -Agent Serving ICML2026

链接: https://arxiv.org/abs/2605.22566
作者: Ao Li,Shangpeng Yang,Fahao Chen,Tianheng Xu,Peng Li,Zhou Su
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Large Language Model (LLM)-based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task-specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key-Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state-of-the-art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4 \times reduction in memory footprint.

[LG-22] Regret-Based (εδ)-optimal Stopping Criteria for Bayesian Optimization

链接: https://arxiv.org/abs/2605.22561
作者: Haowei Wang,Jingyi Wang,Qiyu Wei
类目: Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:Bayesian optimization (BO) is a widely used iterative black-box optimization method that utilizes Gaussian process (GP) surrogate models. In practice, BO is typically terminated after a fixed evaluation budget is exhausted, which can incur unnecessary cost and provides no optimality guarantee on solution quality. Recent research in developing a practical stopping criterion has made empirical progress, yet a theoretically sound stopping criterion remains a work in progress. In this work, we present provably tighter instantaneous regret bounds for GP upper confidence bound (GP-UCB) at any given iteration. Then, we propose stopping criteria for GP-UCB based on this tighter bound that ensures an \epsilon -optimal solution with high probability 1-\delta upon termination. Numerical experiments are performed to validate and demonstrate the effectiveness and efficiency of our stopping criteria.

[LG-23] Neural Flow Operators can Approximate any Operator: Abstract Frameworks and Universal Approcimations

链接: https://arxiv.org/abs/2605.22557
作者: Shuang Chen,Juncai He,Xue-Cheng Tai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We introduce an abstract neural flow framework for neural networks and neural operators. The framework contains two continuous-depth models, namely neural flows with composition and separation structures, and covers both finite-dimensional function approximation and infinite-dimensional operator approximation. We prove well-posedness and universal approximation properties for the corresponding neural flows, including, to the best of our knowledge, the first universal approximation result for flow-based models between infinite-dimensional spaces. We also obtain universal approximation results for convolutional neural flow models. Through suitable time discretizations, the composition structure recovers ResNet-type architectures, while the separation structure, via a splitting-based discretization, yields plain architectures. This gives a unified flow-based route to both residual and plain architectures for neural networks and neural operators with fully connected or convolutional linear layers.

[LG-24] ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation

链接: https://arxiv.org/abs/2605.22556
作者: Haoan Feng,Xin Xu,Leila De Floriani
类目: Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Digital elevation models (DEMs) underpin terrain analysis in Geographic Information Systems (GIS), but in their common raster form, they rely on interpolation for off-grid sampling and finite-difference operators for derivative-based analysis. Implicit neural representations (INRs) offer a continuous alternative, but prior terrain INRs lack explicit frequency control, neglect the gradient structure of terrain, and remain too large and costly to train for practical deployment. We present ImplicitTerrainV2, which advances terrain INRs toward a compact, efficient neural terrain data format by combining a spectral control mechanism with wavelet-guided spatial adaptivity, derivative-aware supervision, and post-training model compression. At its core, a wavelet complexity field (WCF) derives spatially-adaptive frequency masks from analytically computed wavelet coefficients, localizing high-frequency capacity to complex terrain regions. The same field guides complexity-aware adaptive sampling that concentrates training in high-complexity regions, while gradient matching applies extra supervision to enforce the smooth manifold structure of terrain DEMs for improved derivative fidelity. Post-training mixed-precision quantization and entropy coding reduce storage to 1.23 bpp with a 0.28 dB PSNR drop. On 50 Swiss terrain tiles, ImplicitTerrainV2 reaches 66.25 dB end-to-end PSNR, improving over the prior work by 5.70 dB while using 3.2x fewer parameters and training in 55 s per tile on a single GPU. Our compressed neural format is competitive with several established DEM codecs in rate-distortion performance, while additionally supporting off-grid point queries, closed-form derivative evaluation, and resolution-independent reconstruction, which may benefit many downstream GIS applications.

[LG-25] F-TIS: Harnessing Diverse Models in Collaborative GRPO ICML2026

链接: https://arxiv.org/abs/2605.22537
作者: Nikolay Blagoev,Oğuzhan Ersoy,Wendelin Boehmer,Lydia Yiyu Chen
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

点击查看摘要

Abstract:Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model’s learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model’s performance by up to 12%.

[LG-26] Relational Linear Properties in Language Models: An Empirical Investigation

链接: https://arxiv.org/abs/2605.22532
作者: Giovanni Valer,Luigi Gresele,Marco Bronzini,Emanuele Marconato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear properties are ubiquitous in the representations of language models; however, testing them experimentally remains a challenging task. This work focuses on relational linearity: the hypothesis that, for a fixed relation (e.g., “plays”), the unembedding of an object (e.g., “trumpet”) can be predicted from the embedding of its subject (e.g.,“Miles Davis”) by a linear map. We present an experimental method to test the formulation of relational linearity by Marconato et al. (2025). Specifically, we introduce a probing method, based on Kullback-Leibler divergence, to evaluate this property and examine its variation across layers and paraphrased relational queries. It is also more efficient than previous work; for example, it avoids the crude Jacobian approximations used in Linear Relational Embeddings by Hernandez et al. (2024). Our findings across four datasets show that relational linearity varies across models, exhibits layer-wise patterns consistent with prior observations about linguistic information in model representations, and is differently affected by changes in how the relation is phrased.

[LG-27] Disentanglement Beyond Generative Models with Riemannian ICA

链接: https://arxiv.org/abs/2605.22531
作者: Edmond Cunningham
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is a gap between the theoretical foundations of disentanglement and the practice of modern representation learning. Existing theoretical frameworks, particularly Independent Component Analysis (ICA) and its nonlinear variants, assume a generative model with statistically independent latent variables underlying the data so that disentanglement amounts to identifying the latents that could have generated the data. This generative framework is interpretable and theoretically justified, but its strong assumptions make it difficult to apply to modern representation learning. Modern pretrained encoders often learn features that exhibit disentangled properties without making generative assumptions, yet there is no general theory for interpreting these features as independent factors of variation. We take a step toward such a theory by introducing Riemannian ICA (RICA), which replaces ICA’s global generative model with local geometric structure. RICA is founded on the observation that in ICA, the factors of variation underlying a data point can be understood through radial curves emanating from the point that map to axis-aligned lines in the latent space. We formalize this perspective using Riemannian geometry and introduce our theory in a way that is consistent with the existing generative approach. Our main contribution is the disentanglement tensor, which encodes a second-order notion of disentanglement that we call pointwise disentanglement. This tensor depends on the Hessian of the data log likelihood as well as the Ricci curvature induced by the model. In a controlled source recovery setting with known ground-truth sources, RICA recovers sources across several manifolds, while the success of ICA baselines depends on the coordinates used to represent the observations. Our work provides a theoretical basis for studying local disentanglement without assuming a global generative model.

[LG-28] Generative Modeling by Value-Driven Transport

链接: https://arxiv.org/abs/2605.22507
作者: Pablo Moreno-Muñoz,Adrian Müller,Gergely Neu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emphoptimal value function of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emphvalue-driven transport (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.

[LG-29] EnCAgg: Enhanced Clustering Aggregation for Robust Federated Learning against Dynamic Model Poisoning

链接: https://arxiv.org/abs/2605.22506
作者: Tianyun Zhang,Zhen Yang,Haozhao Wang,Ru Zhang,Yongfeng Huang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning faces increasing threats from model poisoning attacks, which harms its application to improve privacy. Existing defense methods typically rely on fixed thresholds or perform clustering with a fixed number of clusters to distinguish malicious gradients from benign ones. However, these methods are difficult to adapt to dynamic poisoning strategies of malicious clients, and often result in the loss of benign gradients due to the heterogeneity of clients’ local datasets. To address these problems, we propose a novel robust aggregation method that leverages a small number of known benign clients as references, enabling accurate identification and filtering of malicious gradients while retaining as many benign gradients as possible, even when the number of malicious clients is unknown and variable. First, we introduce a density-based low-dimensional gradient clustering method, which projects gradients onto the two most divergent dimensions and applies density-based clustering to identify malicious gradients while retaining clustered benign gradients and potentially benign outliers. Second, we design an enhancing clustering low-dimensional gradient generator model, which learns to generate pseudo-gradients aligned with the boundary of the benign cluster. These pseudo-gradients act as bridges to connect sparse benign gradient outliers. Third, we introduce low-dimensional gradient re-clustering that clusters the generated pseudo-gradients together with real gradients to recover benign gradients misclassified as noise points, enabling more benign gradients to participate in aggregation. Extensive experiments on the MNIST, CIFAR-10, and MIND datasets demonstrate that our method exhibits superior fidelity and robustness under dynamic poisoning scenarios.

[LG-30] he Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces

链接: https://arxiv.org/abs/2605.22496
作者: Philipp Bomatter,Jack Geary,Henry Gouk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models offer a natural foundation for out-of-distribution (OOD) detection, yet prior work has shown that their assigned likelihoods are notoriously unreliable indicators for in- vs out-of-distribution data. In this paper, we address this problem by leveraging the diffeomorphic and mass-preserving properties of continuous normalising flows. Our analysis shows that OOD samples are mapped to noise samples that are highly atypical under the noise prior in ways not captured by the likelihood. Based on this observation, we propose a new method – Signal in the Noise (SITN) – for OOD detection on the single-sample level. SITN requires no access to OOD data, incurs minimal computational overhead, and provides strict control of false positive rates. Comprehensive evaluations through standard benchmarks and synthetic perturbations highlight the method’s effectiveness and the absence of the complexity bias inherent to likelihood-based methods.

[LG-31] Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

链接: https://arxiv.org/abs/2605.22488
作者: Ishita Darade,Sushrut Thorat
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Structured prompts require integrating components according to task-relevant relations. How a network implements this integration is often hard to judge in language or vision, where those relations are rarely specified precisely enough to define a candidate internal algorithm. Arithmetic offers a cleaner setting. We study a Transformer trained on base-digit extraction: given N , B , and D , it must report the coefficient of B^D in the base- B expansion of N . The closed-form solution, \lfloor N/B^D \rfloor \bmod B , provides explicit candidate algorithmic intermediates. Across three seeds, the model reaches 99.83% exact-answer accuracy on held-out number-base intersections, establishing reliable task competence. Linear probes decode the intermediates, making staged arithmetic computation plausible. Causal tests then separate representation from use: within the localized route from the stream with D as input to the output positions, behavior depends on early D -selective communication, independent of N and B . Relatedly, a sparse circuit search finds mostly separate N , B , and D routes that combine late rather than the staged route suggested by the probes. Thus, the model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

[LG-32] When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks

链接: https://arxiv.org/abs/2605.22481
作者: Donald Flynn,Hadas Yaron Goldhirsh,Jonathan P. Keating,Inbar Seroussi
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Backdoor poisoning attacks behave counter-intuitively in high dimensions: stronger training triggers can help the defender. We study regularised generalised linear models on Gaussian-mixture data in the proportional regime ( p/n \to \kappa ), varying the training trigger strength \alpha against a fixed test trigger. Three phenomena emerge: (i) clean test accuracy increases with \alpha ; (ii) attack success peaks at a finite \alpha and then declines; and (iii) the most damaging trigger direction is the minimum eigenvector of the data covariance. We prove all three results in closed form for the squared loss, and extend (i) and (ii) to general convex GLM losses via a Gaussian-proxy fixed-point system. We identify a finite-sample noise floor proportional to \kappa as the mechanism behind (i), invisible to classical n \gg p analysis. Experiments on CIFAR-10 and Gaussian surrogates match the theory closely; ResNet-18 experiments show the same phenomena beyond the convex setting.

[LG-33] Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning

链接: https://arxiv.org/abs/2605.22472
作者: Julian Gutheil(1),Simon Hitzginger(1),Robert Legenstein(1) ((1) Graz University of Technology)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Winner-take-all (WTA) networks constitute a central circuit motif in cortical networks of the brain. In addition, WTA-like activations are abundant in modern deep learning models in the form of the softmax activation for example in attention layers of transformers. While their role in the extraction of latent factors has been studied for relatively simple generative models, their role in the context of highly non-linearly entangled latent factors has remained elusive. In this article, we show that a WTA bottleneck within a deep neural network can enforce under certain well-defined conditions the extraction of categorical latent factors of the data in a multi-task learning setup. In particular, we prove that the representation that emerges in the WTA bottleneck is highly symbolic, where a single neuron or a population of neurons encodes the presence of a single abstract feature such as a specific object, color, or position. We furthermore show empirically on two datasets, that this also holds for architectures and setups that do not fully comply with the assumptions of our theorem and demonstrate the advantages of the acquired symbolic representation for generalization. Our proposed model provides insights into the generalization capabilities of deep neural networks with WTA-like components and may serve as an interface between symbolic and subsymbolic AI systems.

[LG-34] Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

链接: https://arxiv.org/abs/2605.22471
作者: Maya Bechler-Speicher,Gilad Yehudai,Gil Harari,Clayton Sanford,Amir Globerson,Joan Bruna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have become a central architecture for graph learning, but their application to graphs requires first choosing a tokenization: a graph-to-token map that determines which structural information is exposed at the input. In this work, we show that this choice is a fundamental component of transformer expressivity. We examine three tokenizations that serve as building blocks for many existing graph tokenizations: spectral, random-walk, and adjacency tokenizations. We prove that different tokenizations induce distinct depth regimes: the same graph computation may be realizable by a shallow transformer under one tokenization, while requiring substantially larger depth under another. For example, we prove that random-walk tokenization is lossy for any walk length, making it impossible in general to recover the graph from it, and that while spectral tokenization is lossless, it is ill-conditioned for local tasks. We further show that although both random-walk and spectral tokenizations are derived from adjacency information, it is impossible for a limited-depth transformer to convert between tokenization families in general. In particular, we establish lower bounds and impossibility results showing that unfavorable tokenizations may preclude the efficient recovery of more suitable structural representations. Finally, we complement our theory with controlled experiments on synthetic and real-world tasks, validating the predicted separations and showing that different tasks favor different structural views, and combining complementary tokenizations allows the transformer to leverage distinct signals from each representation.

[LG-35] AMUSE: Anytime Muon with Stable Gradient Evaluation

链接: https://arxiv.org/abs/2605.22432
作者: Jueun Kim,Baekrok Shin,Jihun Yun,Beomhan Baek,Minhak Song,Chulhee Yun
类目: Machine Learning (cs.LG)
*备注: 41 pages, 25 figures

点击查看摘要

Abstract:Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon’s strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon’s orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon’s rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.

[LG-36] Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

链接: https://arxiv.org/abs/2605.22416
作者: An Xuan Nguyen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: 11 pages, 8 figures, 6 tables. Code and reproducibility artifacts at this https URL

点击查看摘要

Abstract:Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired-bootstrap 95% confidence intervals. A phase-time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity-pressured workloads, and faster allocation calls on KV-heavy workloads. Implementation is pure Python; Triton integration is future work.

[LG-37] Minimum Description Length based Granular-Ball Tree Regularization for Spectral Clustering

链接: https://arxiv.org/abs/2605.22410
作者: Zeqiang Xian,Caihui Liu,Yong Zhang,Wenjing Qiu
类目: Machine Learning (cs.LG)
*备注: 28 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Spectral clustering largely depends on the affinity graph, yet constructing a graph that preserves reliable local connectivity while adapting to heterogeneous data structures remains challenging. Existing granular-ball-based spectral clustering methods usually reduce graph complexity by using coarse-grained representatives. However, the learned local regions are often treated as graph nodes or anchors, and their structural information is not sufficiently used to regularize the original sample-level graph. To address this issue, this paper proposes a Minimum Description Length based Granular-Ball Tree-Regularized Spectral Clustering method, termed MDL-GBTRSC. The proposed method constructs a granular-ball tree through local MDL model selection, with reciprocal neighborhood continuity used to discourage splits that break reliable local connections. The stable leaf balls obtained from the tree provide coding-scale information for regularizing the sample-level affinity graph. In addition, a shared-neighbor bridge code is introduced to adjust weak local bridge relations without requiring an additional user-specified threshold. In this way, MDL-GBTRSC connects interpretable local representation learning with affinity graph construction in a unified spectral clustering framework. Experiments on real and synthetic datasets show that MDL-GBTRSC achieves the best average ARI and NMI under the adopted fixed-configuration protocol compared with classical spectral clustering baselines and representative granular-ball, micro-cluster, and anchor-based methods.

[LG-38] Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology

链接: https://arxiv.org/abs/2605.22401
作者: Nils Leutenegger
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Does the relationship between learning rules and brain alignment generalize across species? We extend our prior finding that untrained CNNs match backpropagation at human V1 by testing the same five learning rules against macaque electrophysiology. The rules are backpropagation (BP), feedback alignment (FA), predictive coding (PC), spike-timing-dependent plasticity (STDP), and an untrained random-weights baseline. The macaque data come from two datasets: MajajHong2015 (V4/IT, 3,200 stimulus presentations, 88/168 neurons) and FreemanZiemba2013 (V1/V2, 135 stimuli, 102/103 neurons). Using RSA with identical model weights from our human study, we find: (1) all models achieve higher alignment with macaque early visual cortex (rho = 0.15-0.30 at V1/V2) than with human fMRI (rho = 0.01-0.08), consistent with the higher signal-to-noise ratio of electrophysiology; (2) STDP and PC produce the highest macaque V1/V2 alignment (rho ~ 0.30 and 0.28), consistent with their leading position among trained rules in human V1; (3) at IT, learning rule rankings show no detectable correlation across species (Kendall’s tau = 0.00, p = 1.00), though this null result is expected given that n = 5 provides power only at tau = +/-1.0, and is further confounded by stimulus set differences; (4) a pretrained ResNet-50 (ImageNet) achieves rho = 0.25 at macaque IT, substantially above all custom CNN conditions (rho = 0.07-0.14), suggesting IT alignment is limited by model capacity and training data rather than by the learning rule. Noise ceilings, multi-seed variability (5 seeds), and a stimulus-control analysis are reported. These results demonstrate that early visual alignment is robust across species, while higher-area alignment is modulated by model capacity and stimulus domain.

[LG-39] A Posterior-Predictive Variance Decomposition for Epistemic and Aleatoric Uncertainty in Wind Power Forecasting

链接: https://arxiv.org/abs/2605.22390
作者: Yinsong Chen,Samson S. Yu,Kashem M. Muttaqi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate wind power forecasting requires reliable uncertainty quantification, yet most existing methods report a single predictive uncertainty that conflates epistemic and aleatoric sources. This paper applies the law of total variance to the joint setting of heteroscedastic neural network regression and Bayesian posterior approximation, deriving an explicit decomposition of total uncertainty (TU) into aleatoric (AU) and epistemic (EU) components. The resulting estimators are compatible with standard posterior-approximation methods and with \beta -NLL training to regulate the mean–variance learning trade-off. A wind power–specific evaluation framework is proposed to validate disentanglement without access to ground-truth uncertainty labels, comprising three modules: controlled synthetic experiments to verify responses to heteroscedastic noise and distribution shift; data-property–driven validation on a real-world wind turbine SCADA dataset; and dataset-size scaling experiments to examine the predicted asymptotic behavior of EU. Across synthetic and real-world experiments, the decomposed AU and EU components respond in theoretically consistent directions to noise structure, distributional shift, and training-scale variation, supporting the theoretical consistency and operational utility of the proposed decomposition and evaluation protocol.

[LG-40] Hybrid Kolmogorov-Arnold Network and XGBoost Framework for Week-Ahead Price Forecasting in Australias National Electricity Market

链接: https://arxiv.org/abs/2605.22387
作者: Houxuan Zhou,Sriram Prasad,Chenghao Huang,Jiajie Feng,Hao Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: The 24th IEEE International Conference on Industrial Informatics, 2026

点击查看摘要

Abstract:Accurate electricity price forecasting (EPF) is essential for market participants to support operational planning and risk management, yet remains challenging due to strong volatility, nonlinear dynamics, and frequent extreme price spikes. These challenges are particularly pronounced in the Australian National Electricity Market (NEM), where high renewable penetration further increases uncertainty. This paper investigates week-ahead electricity price forecasting and proposes a hybrid KAN+XGBoost framework that integrates Kolmogorov-Arnold Networks (KAN) with tree-based learning. The proposed approach combines the global nonlinear representation capability of KAN with the local robustness of XGBoost to capture both long-term dependencies and short-term price fluctuations. Experiments are conducted on real-world NEM data using an expanding window evaluation strategy. The results demonstrate that the proposed model outperforms benchmark methods, including SARIMAX, Long Short-Term Memory (LSTM), standalone KAN, and XGBoost, reducing MAE by approximately 12% compared to XGBoost and by over 50% compared to a naive baseline. The results suggest that hybrid learning strategies provide an effective and robust solution for electricity price forecasting in highly dynamic electricity markets.

[LG-41] Efficient Higher-order Subgraph Attribution via Message Passing ICML2022

链接: https://arxiv.org/abs/2605.22385
作者: Ping Xiong,Thomas Schnake,Grégoire Montavon,Klaus-Robert Müller,Shinichi Nakajima
类目: Machine Learning (cs.LG)
*备注: Published in ICML 2022

点击查看摘要

Abstract:Explaining graph neural networks (GNNs) has become more and more important recently. Higher-order interpretation schemes, such as GNN-LRP (layer-wise relevance propagation for GNN), emerged as powerful tools for unraveling how different features interact thereby contributing to explaining GNNs. GNN-LRP gives a relevance attribution of walks between nodes at each layer, and the subgraph attribution is expressed as a sum over exponentially many such walks. In this work, we demonstrate that such exponential complexity can be avoided. In particular, we propose novel algorithms that enable to attribute subgraphs with GNN-LRP in linear-time (w.r.t. the network depth). Our algorithms are derived via message passing techniques that make use of the distributive property, thereby directly computing quantities for higher-order explanations. We further adapt our efficient algorithms to compute a generalization of subgraph attributions that also takes into account the neighboring graph features. Experimental results show the significant acceleration of the proposed algorithms and demonstrate the high usefulness and scalability of our novel generalized subgraph attribution method.

[LG-42] owards Explainability of SLMs by investigating Token Level Activation

链接: https://arxiv.org/abs/2605.22377
作者: Sayantani Ghosh,Rajashik Datta,Amit Kumar Das,Amlan Chakrabarti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based language models such as BERT having 110M+ parameters have revolutionized natural language understanding, yet their internal mechanisms remain largely opaque to researchers and practitioners. Traditional attention-based interpretability methods often emphasize structurally important but semantically weak tokens such as punctuation marks rather than meaningful semantic relationships. This work introduces a lightweight and model-agnostic framework for quantifying token-level representational importance using hidden-state activation strengths at Layer 8 of BERT. The proposed Activation Flow Network (AFN) framework computes Token Activation Strength using the L2 norm of Layer-8 hidden representations, enabling direct ranking of semantically salient tokens. The study further introduces a threshold-based activation bucket formulation that partitions tokens into HIGH-activation and LOW-activation groups using an empirical upper-quartile activation boundary. Experimental observations demonstrate that semantically meaningful content words consistently occupy the HIGH-activation bucket and dominate representational activation shifts, while structurally supportive tokens contribute comparatively less. The results suggest that Layer 8 acts as a critical semantic consolidation zone balancing structural and semantic information processing. By revealing how activation magnitudes concentrate around semantically informative tokens, this work provides an interpretable and computationally efficient alternative to attentioncentric analysis, contributing toward transforming BERT from a “black box” into a more transparent “glass box” model for natural language understanding.

[LG-43] arget-Aligned Bellm an Backup for Cross-domain Offline Reinforcement Learning

链接: https://arxiv.org/abs/2605.22376
作者: Wei Liu,Ting Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-domain offline reinforcement learning (CDRL) aims to improve policy learning in a target domain by leveraging data collected from a source domain. Existing works typically assess the transferability of source-domain data by measuring its similarity to target-domain transitions, and implicitly perform transition-level selection. Transitions that are considered similar are assigned higher weights or rewards, while dissimilar ones are down-weighted. However, transition-level similarity does not necessarily imply consistency in long-term returns. Even visually or dynamically similar transitions may lead to significantly different outcomes in the target domain, which can mislead policy learning and degrade performance. To address this issue, we revisit the fundamental objective of policy learning. Since policy optimization ultimately relies on Bellman targets to evaluate the quality of decisions, we propose to assess the transferability of source-domain transitions based on their alignment with target-domain Bellman targets, rather than superficial transition similarity. Based on this insight, we propose a method termed Target-Aligned Bellman Backup (TABB), which selectively leverages source-domain data by measuring their contribution to accurate Bellman target estimation in the target domain. We evaluate TABB across a broad range of cross-domain offline RL settings with highly limited target-domain data. Experimental results show that TABB consistently achieves strong performance.

[LG-44] ASAP: Attention Sink Anchored Pruning

链接: https://arxiv.org/abs/2605.22372
作者: Jaehyuk Lee,Hanyoung Kim,Yanggee Kim,Donghun Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.

[LG-45] Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation ICML2026

链接: https://arxiv.org/abs/2605.22350
作者: Fabian Morelli,Stephan Eckstein
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron-level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at this https URL.

[LG-46] A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

链接: https://arxiv.org/abs/2605.22341
作者: Marcel Kühn,Yoon Thelge,Bernd Rosenow
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment D and the residual student variance \Delta . At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width O(D^-1) remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero \Delta . As a function of the training time \alpha the late-time solution yields a \alpha^-1/3 power law not only for the test loss but also for the generalization error \epsilon_g , i.e., one minus test accuracy. This is much slower than the \alpha^-1 Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a \epsilon_g \sim \alpha^-1/2 power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.

[LG-47] From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching

链接: https://arxiv.org/abs/2605.22340
作者: Siyu Pu,Qingqing Long,Xiaohan Huang,Haotian Chen,Jiajia Wang,Meng Xiao,Xiao Luo,Hengshu Zhu,Yuanchun Zhou,Xuezhi Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) provides high-dimensional profiles of cellular states, enabling data-driven modeling of cellular dynamics over time. In practice, time-resolved scRNA-seq is collected at only a few discrete time points as unpaired snapshot populations, leaving substantial temporal gaps. This motivates trajectory inference at unmeasured time points. Existing methods mainly follow two directions, optimal-transport (OT) alignment provides distribution-level matching between observed snapshots, while continuous-time generative models support forecasting via learned dynamics. However, two challenges remain: (i) unpaired snapshots render local transitions between adjacent time points ambiguous, leading to unstable supervision; and (ii) long-horizon prediction relies on repeated integration, where small modeling errors compound and cause distribution drift. To address these challenges, we propose single-cell Flow Matching (scFM), a latent generative framework based on coupling-conditioned flow matching. First, we compute entropically regularized OT couplings between adjacent snapshots and use them to construct soft, weighted flow-matching targets for learning time-dependent velocity fields. Second, we learn bidirectional velocity fields and leverage their consistency to refine couplings and improve temporal coherence under sparse supervision. Third, we introduce distribution-level alignment and latent dynamic regularization to anchor long rollouts and mitigate drift. Experiments on real-world time-series scRNA-seq datasets show that scFM consistently improves distributional prediction performance for both temporal interpolation and extrapolation. Moreover, scFM yields more accurate trajectory reconstruction and temporally coherent visualizations where intermediate time points are absent, indicating a more faithful recovery of underlying temporal gene expression dynamics.

[LG-48] Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction

链接: https://arxiv.org/abs/2605.22338
作者: Ziyuan Zhu,Keyu Hu,Zhifei Chen,Yuhao Shi,Ming Bao,Jing Zhao,Gang Wang,Haitan Xu,Jiadong Li,Qijun Zhao,Xiaodong Li,Minghui Lu,Yanfeng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconstructing continuous physical fields from sparse measurements is a central inverse problem, but data-driven generative models can produce states that violate governing dynamics. We introduce a physics-informed generative solver that separates stable prior learning from inference-time enforcement of conservation laws. Martingale-Regularized Score Matching regularizes score pretraining with a Score Fokker-Planck constraint, yielding a dynamically stable prior. Physics-Informed Implicit Score Sampling then guides denoising trajectories by gradients of physical residuals, projecting samples toward admissible manifolds without retraining. In acoustics, the method co-generates pressure and particle velocity from sparse sensors, enabling dense virtual arrays that suppress spatial aliasing. The same framework generalizes to real-world ERA5 meteorological fields under extreme sparsity. Together, this work establishes a rigorous and generalizable paradigm for solving high-dimensional inverse problems, bridging the gap between generative artificial intelligence and first-principles science.

[LG-49] Learning Causal Orderings for In-Context Tabular Prediction

链接: https://arxiv.org/abs/2605.22335
作者: Sascha Xu,Sarah Mameche,Jilles Vreeken
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning for tabular data sets strong predictive standards in observational settings; it however primarily relies on correlational structure, which becomes unreliable under distribution shift or intervention. While established methods to discover causal structure exist, they are often focused on structure identifiability and decoupled from the predictive architectures that could benefit from them. To bridge these perspectives, we study how to simultaneously infer and enforce causal structure in the form of topological variable orderings into tabular prediction. Unlike standard architectures, our model TabOrder uses causal order-constrained attention, basing predictions only on features that precede a target under a learned causal order. Similar to causal discovery methods, TabOrder learns the optimal variable ordering in an unsupervised manner through a likelihood-based objective. We justify this choice under standard functional model classes and also study how sample missingness, a common challenge in tabular data, interacts with causal direction identification. Empirically, we confirm that TabOrder recovers accurate variable orderings while addressing prediction and imputation tasks, as well as gives insight into real-world biological data under intervention. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.22335 [cs.LG] (or arXiv:2605.22335v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22335 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Riemannian geometry meets fMRI: the advantages of modeling correlation manifolds and eigenvector subspaces

链接: https://arxiv.org/abs/2605.22334
作者: Mario Severino,Manuela Moretto,Robert A. McCutcheon,Mattia Veronese
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Correlation matrices are fundamental summaries of functional brain networks, yet standard analyses often treat entries independently, ignoring the curved geometry of correlation space. Existing geometric methods frequently lack closed-form operations or depend on arbitrary region ordering, limiting scalability. We introduce a scalable geometric framework with two components: (i) the Off-log metric, a smooth transformation mapping correlation matrices to symmetric zero-diagonal matrices. This enables closed-form expressions for distances, Frechet means, and linear models, allowing standard statistical modeling without complex manifold optimization. (ii) Grassmannian subspace discrimination, which compares subjects via principal-angle distances between eigenvector subspaces, resolving inherent sign and basis ambiguities. Both components integrate into standard machine-learning workflows for inference, regression, and classification. Validated across two clinical cohorts (Parkinson’s and psychosis) and three ageing fMRI datasets, the Off-log metric increased sensitivity in permutation tests and matched or exceeded Riemannian and Euclidean baselines in classification. Brain-age prediction performance was comparable, with Riemannian metrics excelling in two of three cohorts. The Grassmannian method consistently outperformed Euclidean baselines, highlighting disease-relevant networks. Overall, geometry-aware representations improve sensitivity and predictive performance while remaining straightforward to deploy at scale.

[LG-51] Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks ICML2026

链接: https://arxiv.org/abs/2605.22305
作者: Stefan Huber,Hannes Unger,Georg Schäfer,Jakob Rehrl
类目: Machine Learning (cs.LG)
*备注: ICML 2026 Spotlight Paper

点击查看摘要

Abstract:We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simple, yet modern RL agents display a large gap to optimality. Motivated by the analysis of the optimal control, we introduce Chebyshev policies as a universal (i.e. dense) class of RL policies from first principles. They can be trained as drop-in replacements of neural nets, reducing the regret by a factor of 4.18, while requiring 277 times fewer parameters, fostering sample efficiency, explainability and realtime capability. Chebyshev policies are evaluated on further RL tasks, including a real-world nonlinear motion control testbed. They consistently improve performance over neural nets with PPO, ARS and REINFORCE. Our results demonstrate how Chebyshev policies offer a compelling and lightweight alternative or addition to neural nets for low-dimensional control tasks.

[LG-52] Long-term Fairness with Selective Labels

链接: https://arxiv.org/abs/2605.22291
作者: Giovani Valdrighi,Isabel Valera,Marcos Medeiros Raimundo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term fairness algorithms aim to satisfy fairness beyond static and short-term notions by accounting for the dynamics between decision-making policies and population behavior. Most previous approaches evaluate performance and fairness measures from observable features and a label, which is assumed to be fully observed. However, in scenarios such as hiring or lending, the labels (e.g., ability to repay the loan) are selective labels as they are only revealed based on positive decisions (e.g., when a loan is granted). In this paper, we study long-term fairness in the selective labels setting and analytically show that naive solutions do not guarantee fairness. To address this gap, we then introduce a novel framework that leverages both the observed data and a label predictor model to estimate the true fairness measure value by decomposing it into the observed fairness and bias from label predictions. This allows us to derive sufficient conditions to satisfy true fairness from observable quantities by using the confidence in the predictor model. Finally, we rely on our theoretical results to propose a novel reinforcement learning algorithm for effective long-term fair decision-making with selective labels. In semisynthetic environments, the proposed algorithm reached comparable fairness and performance to an agent with oracle access to the true labels.

[LG-53] Adaptive Measurement Allocation for Learning Kernelized SVMs Under Noisy Observations

链接: https://arxiv.org/abs/2605.22275
作者: Artur Miroszewski
类目: Machine Learning (cs.LG)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:Kernel methods are typically formulated under the assumption of exact, noise-free access to the Gram matrix. However, in emerging settings such as quantum machine learning, each kernel entry must be inferred from noisy observations, and its accuracy depends on how a limited measurement budget is allocated. Despite this, existing approaches overwhelmingly rely on uniform allocation, which equalizes estimator variance but ignores the highly non-uniform dependence of kernelized classifiers on the Gram matrix. In this work, we introduce an adaptive measurement-allocation strategy for learning kernelized Support Vector Machines (SVMs) from noisy Bernoulli observations. Our approach combines two complementary principles: (i) geometric sensitivity, capturing how perturbations of individual kernel entries affect the classifier margin, and (ii) active-set instability, quantifying the probability of discrete changes in support-vector membership induced by measurement noise. These signals define a task-aware allocation scheme that concentrates measurements on the most decision-critical regions of the kernel matrix. We provide a theoretical analysis showing that the benefit of adaptive allocation is governed by the heterogeneity of the induced kernel importance structure, leading to distinct regimes in which adaptive or uniform strategies are preferable. Empirical evaluations on synthetic datasets demonstrate that adaptive allocation significantly improves support-vector recovery, margin estimation, and decision-function accuracy under fixed measurement budgets. A dual-coefficient stability criterion further enables early stopping, achieving near-optimal performance while using only a fraction of the measurement cost. Additional experiments on quantum kernels derived from real-world data reveal a regime-dependent behavior aligned with known phenomena such as kernel concentration. Together…

[LG-54] Automatic Contextual Audio Denoising

链接: https://arxiv.org/abs/2605.22262
作者: Diep Luong,Konstantinos Drossos,Mikko Heikkinen,Tuomas Virtanen
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.

[LG-55] No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation

链接: https://arxiv.org/abs/2605.22248
作者: Bradley Stanley-Clamp,Anson Lei,Hannah M. Christensen,Ingmar Posner
类目: Machine Learning (cs.LG)
*备注: 36 pages, 12 figures

点击查看摘要

Abstract:Climate emulation is an out-of-distribution (OOD) projection task. This is precisely the challenge where modern Machine Learning (ML) methods are most prone to failure. Consequently, while current ML emulators trained on present climate achieve high in-distribution performance, their future reliability under the inevitable distribution shifts of a changing climate remains a critical, poorly understood blind spot. Addressing this challenge requires a fundamental shift in how we understand, evaluate, and design climate emulators. In this work, we first confirm that climate change drives a statistically significant and progressively growing shift in atmospheric state distributions, rendering standard evaluation protocols insufficient. We empirically establish that seasonal variation serves as an effective proxy for these long-term climate shifts, providing access to \textitreal-world distribution shifts without recourse to heuristics like synthetic perturbations. Motivated by this link, we introduce a novel evaluation framework that leverages seasonal shifts as a rigorous, zero-overhead testbed for emulator robustness. Our systematic characterisation confirms that current state-of-the-art hybrid-ML emulators degrade significantly under these realistic shifts. Finally, we chart a path forward by identifying compositional generalisation, the ability to form novel combinations from observed elementary components, as a principled route towards robust climate emulation. We demonstrate that physically motivated decompositions substantially improve OOD performance with only modest trade-offs against in-distribution performance, providing an avenue towards ML-driven climate emulators robust to an unknown future.

[LG-56] Decomposing Ensemble Spread in Lorenz 96 With Learned Stochastic Parameterizations

链接: https://arxiv.org/abs/2605.22242
作者: Birgit Kühbacher,Daan Crommelin,Niki Kilbertus
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Weather and climate forecasts are inherently uncertain due to chaotic dynamics, imperfect initial conditions, and incomplete representation of the underlying physical processes. Operational ensemble forecasts aim to represent these uncertainties through forecast spread, yet many approaches yield underdispersive estimates, with spread that grows too slowly relative to forecast error. Using the two-scale Lorenz 1996 system as a widely used, controlled testbed, we design a systematic approach to disentangle intrinsic variability, initial-condition perturbations, and stochastic model uncertainty. We compare multiple ensemble configurations and parameterization strategies, including existing deterministic and autoregressive as well as novel Bayesian and flow-based approaches. Our results show that ensemble perturbations do not increase the system’s long-term variance; rather, they regulate how rapidly trajectories decorrelate and explore the invariant measure. Stochastic parameterizations, particularly those with temporally persistent structure, enhance early spread growth and improve spread-error consistency. Overall, we bring clarity to how different sources of uncertainty interact in a chaotic system and provide guidance for the design and evaluation of stochastic parameterizations in weather and climate models.

[LG-57] Decision-Aware Quadratic ReLU Replacement for HE-Friendly Inference

链接: https://arxiv.org/abs/2605.22237
作者: Rui Li,Wenyuan Wu,Weijie Miao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:Fully homomorphic encryption (FHE) supports only additions and multiplications, so FHE-only neural-network inference typically replaces ReLU with polynomials fitted over empirical activation intervals. Such interval fitting often requires higher-degree polynomials to control activation error, incurring homomorphic evaluation costs, while classification is determined by the final logit decision. We revisit ReLU replacement from a decision-aware perspective: given a trained single-hidden-layer ReLU MLP and a specified calibration set, can an HE-friendly low-degree polynomial replace ReLU without retraining while preserving calibration-set decisions? We focus on quadratic replacement, the lowest-degree choice that retains a genuine per-unit nonlinearity. For calibration sets positive-margin separable in the lifted space, we formulate quadratic replacement as a linear separation problem, yielding necessary and sufficient conditions for calibration-lossless replacement and a constructive algorithm for the coefficients. When the positive-margin condition fails – typically due to a few misclassified calibration samples – we extend the same geometric framework via reduced convex hulls and Lagrangian-dual soft-margin relaxations, which bound the influence of any single sample, converting the problem into smaller convex quadratic programs that yield approximately feasible coefficients with high empirical agreement on calibration-set decisions. In particular, at the maximal weight cap \mu=1 , the reduced-convex-hull relaxation reduces to the convex-hull separation of the strictly separable case; the relaxation thus continuously extends the exact theory. Under CKKS, the quadratic replacement matches plaintext top-1 accuracy on multiple benchmarks, running 3.7–4.1 \times faster than Remez-7 in the activation module and 1.18–1.68 \times faster end-to-end.

[LG-58] Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics

链接: https://arxiv.org/abs/2605.22235
作者: Bhaskar Ranjan Karn,Dinesh Kumar
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 16 pages. Comments are welcome

点击查看摘要

Abstract:Complex dynamical systems governed by holomorphic maps such as z^2 + c exhibit fractal boundaries with extreme sensitivity to initial conditions. Accurately modelling these structures from data requires methods that respect the underlying complex-analytic geometry, yet Multi-Layer Perceptrons (MLPs) within Neural Ordinary Differential Equations (Neural ODEs) lack complex-analytic priors, violate the Cauchy–Riemann conditions, and function as opaque approximators incapable of yielding governing equations. We introduce Holomorphic KAN-ODE, a framework that replaces the MLP with a Kolmogorov-Arnold Network (KAN) whose learnable B-spline activations reside on network edges, and incorporates Cauchy–Riemann equations as a differentiable regularization to preserve holomorphic structure. We evaluate on six families of complex dynamical systems spanning polynomial and transcendental classes. With only 280 parameters ( 16\times fewer than the MLP baseline), the network achieves velocity-field R^2 0.95 on all six systems, correctly identifies all six governing symbolic families through automatic spline-to-formula fitting, and reconstructs Julia set fractal boundaries with up to 98.0% agreement. Crucially, the model exhibits only 4% MSE degradation under 10% observation noise versus 15.2\times for MLPs, and achieves 90.4% improvement in transfer learning from quadratic to cubic dynamics. While the MLP attains lower pointwise reconstruction error due to its larger capacity, the KAN uniquely provides interpretable symbolic equations, enforced holomorphic structure, and superior noise resilience, capabilities that are entirely absent in black-box architectures. These results establish KANs as a parameter-efficient, interpretable alternative to MLPs for physics-informed discovery of holomorphic dynamics.

[LG-59] How Many Different Outputs Can a Transformer Generate? ICML2026

链接: https://arxiv.org/abs/2605.22223
作者: Maxime Meyer,Mario Michelessa,Caroline Chaux,Vincent Y. F. Tan
类目: Machine Learning (cs.LG)
*备注: ICML 2026 Spotlight

点击查看摘要

Abstract:We study how we can leverage only a handful of characteristics of a transformer’s architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

[LG-60] ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models

链接: https://arxiv.org/abs/2605.22222
作者: Chengze Li,Lingwei Wei,Li Sun,Hongbo Lv,Jie Yang,Hongrong Zhang,Kening Zheng,Wei-Chieh Huang,Enze Ma,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注: 40 pages, including appendices

点击查看摘要

Abstract:Partial differential equation (PDE) foundation models are pretrained networks that forecast how physical fields like velocity and pressure evolve from a single reusable solver. On unfamiliar flows their predictions drift step by step, errors concentrate in a few regions, yet retraining destabilizes the network and uniform post-hoc correction overlooks this spatial concentration. To address this, we propose a frozen-solver post-hoc correction framework, Adaptive Risk-Calibrated Spatial Triage for Auditable Refinement (ARC-STAR). ARC-STAR organizes correction into three stages: a global corrector removes broad solver bias, a blockwise local refiner cleans the post-global residual, and, at deployment, a label-free score routes refinement to high-risk blocks under a compute budget. The framework is designed to be (i) frozen-host, preserving the pretrained solver without fine-tuning; (ii) auditable, with global and local stages trained and evaluated separately for measurable contributions; and (iii) budget-aware, using a blockwise interface that either refines the full field or routes limited compute to high-risk regions. Across five flow benchmarks spanning ten regime cells, ARC-STAR is the only method that cuts velocity rollout error by at least 36x over raw Poseidon on every cell. The global stage reduces raw host error by 91-99%, and the local stage further reduces the remaining post-global residual by up to 94.4%. Our code implementation is available at this https URL.

[LG-61] Kernel-Based Safe Exploration in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.22207
作者: Rupak Majumdar,Nikhil Singh,Sadegh Soudjani
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted at L4DC Conference (22 Jan 2026)

点击查看摘要

Abstract:Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emphbarrier function along with the policy. A barrier is a function from states to reals that assigns low values to the initial states, high values to the unsafe states, and decreases in expectation on each transition; such a function can be used to bound the probability of reaching unsafe states. Previous attempts learned a barrier function directly from exploration data, but this required either large amounts of data or restrictions on the system dynamics. In this paper, we show how kernel embeddings can be used to learn barrier functions during deep reinforcement learning for stochastic systems with unknown dynamics. Our algorithm, \emphkernel-based safe exploration (KBSE), learns an optimal policy and a barrier simultaneously during exploration. The barriers are computed iteratively, represented as conditional mean embeddings, and provide better probabilistic safety guarantees with more exploration. The exploration algorithm uses the learned barrier functions to identify safety violations. In the case of violation, it intervenes to modify the unsafe action to a safe action, thereby ensuring that the exploration is restricted to actions that bound the probability of reaching unsafe states. We evaluate KBSE on several complex continuous control benchmarks. Experimental results establish our new algorithm to be suitable for synthesizing control policies that are probabilistically safe without degradation in reward accumulation. Comments: Accepted at L4DC Conference (22 Jan 2026) Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2605.22207 [eess.SY] (or arXiv:2605.22207v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2605.22207 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLM s

链接: https://arxiv.org/abs/2605.22195
作者: Manuel Noah Riesen,Peter Alfred von Niederhäusern
类目: Machine Learning (cs.LG)
*备注: 26 pages (including appendix), 16 figures

点击查看摘要

Abstract:Graph of Thoughts (GoT), a generalized form of recent prompting paradigms for large language models (LLMs), has been shown to be useful for elaborate problem solving. By executing a graph of operations, thoughts of the LLM are structured as an arbitrary graph, forming the actual graph of thoughts. Originally, the graph of operations is defined manually, which requires in-depth knowledge about the solution of the problem to solve. Such a static graph of operations is rigid and therefore lacks adaptability. We propose Reinforced Graph of Thoughts (RGoT), an automated approach to the GoT prompting paradigm that leverages reinforcement learning (RL) to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task’s complexity in an automated way.

[LG-63] Bandit Convex Optimization with Gradient Prediction Adaptivity

链接: https://arxiv.org/abs/2605.22191
作者: Shuche Wang,Adarsh Barik,Vincent Y. F. Tan
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions m_t , we seek regret bounds that scale with the cumulative prediction error S_T=\sum_t=1^T |\nabla f_t(x_t)-m_t|^2. We first establish a negative result: under the single-point feedback protocol, an unavoidable \Omega(\sqrtT) regret lower bound persists even when S_T=o(T) , showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emphTwo-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of O\big(\sqrtd,\mathbbE[S_T]\big), where d is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as \Omega(\sqrt\mathbbE[S_T]) , providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of \sqrt d . We further develop adaptive variants that eliminate the need for prior knowledge of \mathbbE[S_T] or the horizon T , and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

[LG-64] From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs

链接: https://arxiv.org/abs/2605.22188
作者: Jiachang Liu,Andrea Lodi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU–GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

[LG-65] IKNO: Infinite-order Kernel Neural Operators

链接: https://arxiv.org/abs/2605.22182
作者: Pengyuan Zhu,Ivor W. Tsang,Yueming Lyu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators have achieved significant success in modern scientific computing due to their flexibility and strong generalization capabilities. Existing models, however, primarily rely on first-order kernel integral approximations, which severely limit their expressivity. To address this, we propose the Infinite-order Kernel Neural Operator (IKNO), which constructs neural operators via infinite-order kernel integrals and admits an elegant closed-form finite approximation. We develop two complementary infinite-order neural operator constructions: IKNO-Vanilla, which applies the full-kernel resolvent on the product grid via Kronecker eigendecomposition, and IKNO-TP, an alternative tensor-product operator that composes per-axis resolvents. Furthermore, we develop fast computation schemes for both variants of IKNO, which achieve outstanding global information aggregation while maintaining high computational efficiency. Empirically, we evaluate our IKNO on both time-dependent and time-independent benchmarks with arbitrary input shapes, including large-scale industrial datasets. Extensive experiments demonstrate that the IKNO method consistently achieves the SOTA accuracy with significant improvements on nearly all benchmark datasets while maintaining scalability to very large point clouds.

[LG-66] Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

链接: https://arxiv.org/abs/2605.22164
作者: Liangyu Li,Shengzhi Wang,Qingwen Liu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

[LG-67] Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines

链接: https://arxiv.org/abs/2605.22155
作者: David Mendez,Fernando Martin-Maroto,Gonzalo G. de Polavieja
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Symbolic methods are generally not considered competitive with strong modern learners on realistic supervised tasks. We evaluate Algebraic Machine Learning (AML), a framework that learns through subdirect decomposition of algebraic structure rather than numerical optimization, against standard baselines on image and tabular classification across varying training-set sizes. We find that AML trained only on training data without using validation or cross-validation outperforms a family of cross-validated baseline methods including CNNs on small to medium image datasets (50–2000 training examples). On tabular datasets in the same size range, XGBoost is overall the best performing method, but AML is nonetheless comparable to methods incorporating task-specific biases such as LightGBM and random forests. AML achieves this competitive performance across two very different types of datasets using a generic algebraic inductive bias, rather than the modality-specific biases built into standard baselines like CNNs for images or XGBoost for tabular data, and requires no cross validation because it has no task-dependent hyperparameters to tune.

[LG-68] Aerodynamic force reconstruction using physics-informed Gaussian processes

链接: https://arxiv.org/abs/2605.22111
作者: Gledson Rodrigo Tondo,Igor Kavrakov,Guido Morgenthal
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate modeling of aerodynamic loads is essential for understanding and predicting the responses of complex structural systems. However, these models often rely on simplifications of the true physical forces, introducing assumptions that can limit their accuracy. Validating such models becomes particularly challenging in the presence of noisy or incomplete data. To address this, we introduce a probabilistic physics-informed machine learning approach designed to reconstruct the underlying aerodynamic loads from noisy measurements of structural dynamic responses. The model avoids overfitting, eliminates the need for regularization schemes, and allows for the use of heterogeneous and multi-fidelity data during the training process. The efficacy of the approach is demonstrated through the reconstruction of aerodynamic loads on the Great Belt East Bridge, simulated under a linear unsteady assumption. Results show a strong agreement between true and predicted loads, particularly related to root mean squared errors, magnitude, phase angle and peak values of the signals. The method for load reconstructing holds broad applicability, such as modeling validation, future load estimation, and structural damage prognosis.

[LG-69] RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching INTERSPEECH2026

链接: https://arxiv.org/abs/2605.22083
作者: Jinhyeok Yang,Hyeongju Kim,Yechan Yu,Joon Byun,Frederik Bous,Juheon Lee
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to INTERSPEECH 2026

点击查看摘要

Abstract:While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48% to 0.35% and Korean CER from 0.81% to 0.57%. Audio samples: this https URL

[LG-70] CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

链接: https://arxiv.org/abs/2605.22082
作者: Wentian Wang,Chutong Wen,Hongxu Ma,Wuhao Wang,Zhexiong Xue,Abdul Haseeb Nizamani,Dandi Zhou,Xinhai Sun,Jianqiao Zhu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim~5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

[LG-71] Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes

链接: https://arxiv.org/abs/2605.22075
作者: Varsha Sharma,Prasanta K. Guha,Avik Ghose
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Diabetes is a global health burden, and early detection is critical for timely intervention. This study explores a non-invasive, data-driven framework to identify individuals at risk of diabetes using Volatile Organic Compounds (VOCs) and lifestyle variables. We use causal inference techniques to estimate the impact of VOCs such as acetone, isopropanol, isoprene, and ethanol on blood glucose levels. Additionally, we designed a classifier to distinguish diabetics from non-diabetics using non-invasive markers. We created a risk-based ranking system for individuals in the “gray zone,” and identified natural clusters in the population using Gaussian Mixture Model. Our results suggest that specific VOCs exhibit a strong causal influence on glucose levels and that machine learning models can reliably classify and stratify individuals at high risk. This integrated causal-explainable analysis can support the development of tool for non-invasive early screening of diabetes.

[LG-72] CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2605.22043
作者: Fan Zhang,Yating Cui,Hua Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Multivariate time series (MTS) classification is foundational to pervasive computing and financial analysis, yet existing multi-scale paradigms are often constrained by suboptimal representation fidelity. We identify two critical bottlenecks: temporal non-causality in standard encoders that induces temporal confounding in non-stationary dynamics, and the absence of explicit channel saliency mechanisms that allows noise to contaminate the latent space. To address these challenges, we propose the Causal Attention and Spatio-temporal Encoder Network (CASE-NET), an architecture designed for structural manifold pre-conditioning. CASE-NET synergizes a Causal Temporal Encoder, which enforces physical arrow-of-time constraints via masked self-attention and causal convolutions, with an Adaptive Channel Recalibration module functioning as an information bottleneck to suppress detrimental noise. Comprehensive evaluations across six heterogeneous domains demonstrate that CASE-NET establishes new state-of-the-art benchmarks on four tasks, achieving a peak accuracy of 98.6% on the AWR dataset and superior robustness in non-stationary regimes.

[LG-73] RADAR: Defending RAG Dynamically against Retrieval Corruption

链接: https://arxiv.org/abs/2605.22041
作者: Ziyuan Chen,Yueming Lyu,Yi Liu,Weixiang Han,Jing Dong,Caifeng Shan,Tieniu Tan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While RAG systems are increasingly deployed in dynamic web search, temporal volatility amplifies their vulnerability to adversarial attacks. Existing static-oriented defenses struggle to handle evolving threats and incur prohibitive storage costs in dynamic settings. We propose RADAR, a framework that models reliable context selection as a graph-based energy minimization problem, solved exactly via Max-Flow Min-Cut. By incorporating a Bayesian memory node, RADAR recursively updates a belief state instead of archiving raw historical documents, effectively balancing stability against attacks with adaptability to genuine knowledge shifts. Experiments on a novel dynamic dataset show that RADAR achieves superior robustness and response quality with minimal storage overhead compared to the baselines.

[LG-74] oward Understanding Adversarial Distillation: Why Robust Teachers Fail ICML2026

链接: https://arxiv.org/abs/2605.21999
作者: Hongsin Lee,Hye Won Chung
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026. Code is available at this https URL

点击查看摘要

Abstract:Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher’s soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student’s robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher’s supervisory confidence and the student’s representational limitations on a consistent subset of training data – the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher’s interaction with unlearnable samples. Finally, we demonstrate that a teacher’s predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

[LG-75] Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLM s

链接: https://arxiv.org/abs/2605.21975
作者: Jialin Chen,Aosong Feng,Harshit Verma,Siyi Gu,Haiwen Wang,Ali Maatouk,Yixuan He,Yifeng Gao,Leandros Tassiulas,Rex Ying
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial markets are characterized by extreme non-stationarity, low signal-to-noise ratios, and strong dependence on external information such as news, company fundamentals, and macroeconomic signals. Yet, existing approaches either abstract time-series into text or decouple forecasting from language-based reasoning, leading to a fundamental mismatch between qualitative reasoning and quantitative outcomes. To address this, we introduce StockR1, a time-series-enhanced LLM that unifies stock forecasting and financial reasoning through a verifiable forecast action. Based on a tool-call design, the model first emits a forecast action, which is a structured and interpretable representation of its qualitative market outlook. It then invokes a time-series decoder conditioned on this action to generate distributional future trajectories, leading to more informed question answering and financial reasoning. We optimize the full pipeline with reinforcement learning, where rewards jointly reflect answer validity, forecast accuracy, and consistency between generated actions and observed time-series dynamics. In addition, rewards are reweighted by a sample-level uncertainty scalar, encouraging the model to accommodate varying uncertainty in market dynamics. We evaluate StockR1 on financial question answering and stock forecasting over a large-scale 10-year benchmark. Our method consistently outperforms time-series baselines and general-purpose LLMs, improving reasoning accuracy by 17.7% (4B) and 25.9% (8B). These findings demonstrate that structuring the forecast actions establishes a powerful synergy between language reasoning and temporal prediction, enabling LLMs to reason through verifiable, interpretable, and numerically grounded decisions.

[LG-76] How Sparsity Allocation Shapes Label-Free Post-Pruning Recoverability

链接: https://arxiv.org/abs/2605.21972
作者: Qishi Zhan,Minxuan Hu,Liang He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unstructured magnitude pruning at high sparsity can reduce neural network accuracy to near-random performance, while labeled retraining may be unavailable in practical deployment settings. Label-free post-pruning repair methods can partially recover collapsed sparse models, but their effectiveness depends on the sparse model left by the upstream pruning allocation. This paper studies how sparsity allocation shapes post-repair recoverability under a fixed activation-statistic repair backend. We compare ERK and LAMP allocations under the same label-free repair protocol across CIFAR-10, CIFAR-100, and Imagenette with ResNet-18, ResNet-34, and ResNet-50 at sparsities from 90% to 95.5%. The results show that allocation choice can substantially change post-repair accuracy at the same global sparsity, and that the preferred allocation varies with architecture, dataset difficulty, and sparsity level. We identify a repair-sensitive transition regime in which BatchNorm recalibration begins to fail, while activation-statistic repair still recovers nontrivial accuracy. Additional validation on ImageNet-100 and DenseNet-121 shows that the location and width of this recoverable regime depend on data scale and connectivity structure. These findings suggest that pruning allocation and post-pruning repair should be studied jointly, since the allocation determines how much activation signal remains available for label-free recovery.

[LG-77] An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning

链接: https://arxiv.org/abs/2605.21968
作者: Saurabh Saini,Kapil Ahuja,Thomas Wick,Saurav Kumar
类目: Machine Learning (cs.LG)
*备注: 11 Pages, Double Column, 6 Tables, 5 Figures

点击查看摘要

Abstract:Optimization is essential in deep learning. The foundational method upon which most optimizers are built is momentum-based stochastic gradient descent. However, it suffers from two key drawbacks. First, it has noisy and varying gradients, and second, it has an overshoot phenomenon. To address noisy gradients, Adam was proposed, which remains the most widely used adaptive optimizer. To address the overshoot phenomenon, a control-theory-based PID optimizer was proposed. To tackle both the limitations within a single framework, several variants of Adaptive PID (AdaPID) have recently been proposed. Although AdaPID performs well, it still inherits two critical drawbacks from Adam, namely convergence and stability issues. In this work, we address both these limitations. To fix the convergence issue, we uniquely integrate the idea of using a non-increasing effective learning rate into AdaPID (originally proposed in AMSGrad, an extension of Adam). To fix the stability issue, we innovatively integrate a gradient difference based modulation factor into AdaPID (originally proposed in DiffGrad, another extension of Adam). Combining both these ideas in AdaPID, results in our novel IAdaPID-ADG optimizer. We evaluate our proposed optimizer on multiple datasets, including benchmark datasets (MNIST and CIFAR10) and real-world datasets (IARC and AnnoCerv). The IAdaPID-ADG substantially outperforms all competing optimizers. Additionally, we perform an ablation study on the MNIST dataset to demonstrate the contribution of each added component. Comments: 11 Pages, Double Column, 6 Tables, 5 Figures Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 68T05 ACMclasses: I.2.8; I.2.6 Cite as: arXiv:2605.21968 [cs.LG] (or arXiv:2605.21968v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.21968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Dynamic Mixture of Latent Memories for Self-Evolving Agents

链接: https://arxiv.org/abs/2605.21951
作者: Dianzhi Yu,Vireo Zhang,Hongru Wang,Yanyu Chen,Minda Hu,Wanghan Xu,Siki Chen,Philip Torr,Zhenfei Yin,Irwin King
类目: Machine Learning (cs.LG)
*备注: 19 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Achieving self-evolution in intelligent agents requires the continual accumulation of new knowledge across changing task sequences without forgetting previously acquired abilities. Existing approaches either internalize knowledge by updating model parameters, which induces catastrophic forgetting, or rely on external memory, which fails to genuinely enhance the model’s intrinsic capabilities. We propose MoLEM, a generative mixture of latent memory framework based on a dynamic mixture-of-experts (MoE). We treat multiple experts as independent carriers to generate memory. A router selects and weights experts through key-query matching, and the aggregated latent memory is injected into the reasoning process. The base model for reasoning remains entirely frozen, with all experiential knowledge internalized into the additional modules, avoiding catastrophic forgetting. For continual learning, each training stage is paired with a lightweight autoencoder that selects the appropriate routing group at inference, and inputs that match no stage fall back to the pretrained model. Experiments train the framework on continual-learning sequences spanning math, science, and code domains. After training, we evaluate the framework on the corresponding test sets to measure task learning and competence preservation across continual adaptation stages. After the full continual-learning sequence, our method improves the average accuracy by 10.40% over the Vanilla pretrained baseline, while none of the competing methods consistently exceed this baseline across different training orders.

[LG-79] SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization NEURIPS2026

链接: https://arxiv.org/abs/2605.21948
作者: Xucheng Yu,Haibo Jin,Huimin Zeng,Haohan Wang
类目: Machine Learning (cs.LG)
*备注: 20 pages, NeurIPS 2026 submission

点击查看摘要

Abstract:LLM-based ranking systems are vulnerable to Generative Engine Optimization (GEO) attacks, where adversaries inject semantic signals into product descriptions to artificially boost rankings. We propose SCI-Defense, a three-component defense framework combining Perplexity detection (PPL), Semantic Integrity Scoring (SIS), and Inter-Candidate Detection (ICD). SIS evaluates four manipulation dimensions: Authority Attribution (AA), Narrative Purposiveness (NP), Comparative Claims (CA), and Temporal Claims (TC). Evaluated on 600 Amazon product descriptions across 6 categories, SCI-Defense achieves Precision=1.000 and FPR=0.000, with Recall of 1.000, 0.952, and 0.830 against String, Reasoning, and Review attacks respectively. On 600 MS MARCO web passages, String attacks are blocked with perfect recall while Review attacks yield near-zero recall, as web passages lack the persuasion-oriented signals that SIS targets in product descriptions. We demonstrate that existing defenses – PPL-only filters, SafetyClf content classifiers, and paraphrasing – achieve zero recall against semantic manipulation attacks. We further demonstrate new attacks such as Specification Amplification and Use-Case Saturation can expose semantic relevance manipulation as a structural defense blind spot that suggests directions for future research.

[LG-80] Optimal Guarantees for Auditing Rényi Differentially Private Machine Learning

链接: https://arxiv.org/abs/2605.21938
作者: Benjamin D. Kim,Lav R. Varshney,Daniel Alabi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:We study black-box auditing for machine learning algorithms that claim R \ 'enyi differential privacy (RDP) guarantees. We introduce an auditing framework, based on hypothesis testing, that directly estimates Rényi divergence between neighboring executions using the Donsker-Varadhan (DV) variational estimator. Our analysis yields explicit and non-asymptotic confidence intervals for RDP auditing via class-restricted DV estimators, separating statistical estimation error from algorithmic privacy leakage. We prove matching minimax lower bounds showing that, up to logarithmic factors, our sample-complexity guarantees are information-theoretically optimal, thereby establishing the first optimal guarantees for auditing RDP via DV estimators. Empirically, we instantiate our framework for auditing DP-SGD in a fully black-box setting. Across MNIST and CIFAR-10, and over a wide range of privacy regimes, our auditors produce a strong overall improvement on empirical RDP lower bounds compared to prior state-of-the-art black-box methods especially at small and moderate Rényi orders where accurate auditing is most challenging.

[LG-81] CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion Controllers

链接: https://arxiv.org/abs/2605.21915
作者: Zhi Chen,Shehab Sarar Ahmed,Chenkai Wang,Brighten Godfrey,Gang Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages for main paper, 16 pages in total

点击查看摘要

Abstract:Congestion controllers (CCs) are critical to network performance, and yet their robustness under adverse conditions remains insufficiently understood. While recent learning-based CCs have demonstrated strong performance in controlled environments, it is unclear how they compare to traditional CCs when controllers’ input signals are corrupted or when environmental conditions become systematically challenging. In this paper, we introduce CCLab, an adversarial testing framework for systematically evaluating the robustness of both learning-based and non-learning-based CCs. CCLab includes a reinforcement learning (RL)-based adversarial agent that operates in a closed loop with the congestion control policy, generating bounded perturbations either on input signals (feature-level) or on external network conditions (environment-level), while preserving realism through explicit constraints. Using this framework, we compare learning-based CCs with non-learning-based CCs under both feature-level and environment-level adversarial conditions. While both types of CCs suffer from performance degradation under adversarial testing, we find that learning-based CCs, in general, are more robust than traditional human-designed algorithms. Finally, we show that our adversarial traces can be used to train more robust CCs that outperform existing learning-based CCs under both challenging and normal conditions.

[LG-82] Noise Schedule Design for Diffusion Models: An Optimal Control Perspective

链接: https://arxiv.org/abs/2605.21911
作者: Seo Taek Kong,Weina Wang,R. Srikant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a principled framework for analyzing and designing noise schedules in diffusion models. We show that one can recast this design problem as an optimal control problem, whose state is the Fisher information of the diffusion process which evolves according to an ODE and the control input is the noise schedule. The objective of the optimal control problem is a functional involving the Fisher information, which is shown to be an upper bound on the Kullback-Leibler sampling error. By solving this optimal control problem, we obtain sufficient conditions on noise schedules under which state-of-the-art \tilde\mathcalO (d/n) sampling error is achievable, where d is the data dimension and n is the number of discretization steps. While existing theoretical work also prove that \tilde\mathcalO(d/n) sampling error bounds are achievable, these results hold for specific noise schedules, which do not include the schedules used in practice. Under a further parametric assumption on the data distribution, we show that one can obtain closed-form expressions for the noise schedules. These noise schedules generalize standard empirical schedules such as exponential and sigmoid schedules by allowing additional parameters that can be tuned. Systematically tuning the parameters of these schedules yields new schedules that achieve superior FID scores on image generation benchmarks.

[LG-83] When to Switch Not Just What: Transition Quality Prediction in Clash Royale

链接: https://arxiv.org/abs/2605.21868
作者: Heeyun Heo,Huy Kang Kim
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures, 4 tables; Accepted at IEEE Conference on Games (CoG) 2026

点击查看摘要

Abstract:In competitive games, players frequently switch strategies after losing streaks, yet our analysis of 926,334 match records from 34,619 Clash Royale players reveals a counterintuitive pattern: switching frequency is inversely associated with the win rate, with effects that vary substantially across players and situational contexts. We attribute this to a limitation common in many prior recommendation systems, which evaluate strategies by expected quality while overlooking the behavioral cost of switching and individual differences in switching propensity. We refer to this implicit premise as the Zero Switching Cost Assumption. To address this, we reformulate strategy recommendation as a transition-level decision problem and instantiate it as TQP (Transition Quality Predictor), a three-stage pipeline structured as Who - When - What. PersonaGate suppresses recommendations for players whose strategic consistency is empirically associated with superior outcomes. TimingGate identifies moments when switching is likely to yield a net benefit over staying, using a subtype- and state-matched baseline to control for natural win-rate recovery. ScoreFusion ranks candidate strategies by combining an adoptability signal with predicted transition quality (delta WR). We further introduce SwitchGap, an evaluation metric that measures a policy’s discriminative quality without treating observed player choices as optimal ground truth. This property is particularly important because the most frequent switchers record the lowest win rates. The full pipeline achieves a SwitchGap of +10.4 percentage points at a recommendation rate of 5.4%, and loss-triggered switchers, despite being the lowest-performing group, benefit the most from subtype-conditioned guidance.

[LG-84] On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

链接: https://arxiv.org/abs/2605.21834
作者: Andy Han,Kristina Fujimoto,Avidan Shah,Kiet Nguyen,Kai Xu,Chen Yueh-Han,Ilia Sucholutsky,Rico Angell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model’s own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes: sycophancy, jailbreaking, and safety awareness. Across three model families, OPCT outperforms its SFT counterpart on all safety desiderata. It nearly halves the sycophancy rate relative to baseline (8.1% vs. 15.4%, compared to 11.2% for SFT). Under an adaptive per-target attacker, OPCT holds jailbreak defense success near 99% on held-out jailbreak behaviors, whereas SFT achieves 87% on average. On safety awareness, OPCT outperforms SFT in two out of three models, and matches it on the other. OPCT also largely avoids the capability regressions that SFT induces, such as a 28-point drop on MATH-500. Our results suggest that consistency training is best implemented as OPCT rather than as SFT, especially when generalization beyond the training distribution is desired.

[LG-85] Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale

链接: https://arxiv.org/abs/2605.21820
作者: Ralph Bulanadi,Jefferey Baxter,Arpan Biswas,Hiroshi Funakubo,Dennis Meier,Jan Schultheiß,Rama Vasudevan,Yongtao Liu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Self-driving laboratories or autonomous experimentation are emerging as transformative platforms for accelerating scientific discovery. Bayesian optimization (BO) is among the most widely used machine learning frameworks for these purposes, but these BO-based frameworks rely on predefined scalar descriptors to guide experimentation. In many situations, the determination of an appropriate scalar descriptor can be challenging, and may fail to capture subtle yet scientifically important phenomena apparent to experts with interdisciplinary insight. To overcome this limitation, here we develop deep-kernel pairwise learning (DKPL), an approach for autonomous microscopy experiments which incorporates human expertise and interdisciplinary scientific knowledge into an active learning loop. Instead of relying on explicit scalar objectives, DKPL enables experts to directly evaluate which experimental output is more promising using interdisciplinary knowledge. DKPL then learns a latent utility function from these expert judgements to guide subsequent autonomous microscopy experiments. We demonstrate DKPL’s performance in learning physically meaningful nanoscale structures while effectively prioritizing high-information measurement regions using an experimental model dataset with known ground truth. We further apply DKPL to analyze the character of ferroelectric domain walls, where we find DKPL capable of distinguishing between high and low characteristic domain-wall angles in bismuth ferrite, and able to discover both head-to-head and tail-to-tail domain-wall character in erbium manganite. This development establishes an approach to integrate expert knowledge into autonomous microscopy experiments and demonstrates a pathway toward expert-guided self-driving laboratories capable of addressing scientific problems beyond the limits of scalar-metrics-driven learning.

[LG-86] Symbolic Density Estimation for Discrete Distributions

链接: https://arxiv.org/abs/2605.21813
作者: Ziwen Liu,Meng Li
类目: Machine Learning (cs.LG)
*备注: 28 pages, 5 figures, 22 tables

点击查看摘要

Abstract:Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.

[LG-87] Same Architecture Different Capacity: Optimizer-Induced Spectral Scaling Laws

链接: https://arxiv.org/abs/2605.21803
作者: Nandan Kumar Jha,Brandon Reagen
类目: Machine Learning (cs.LG)
*备注: 31 pages, 10 figures, 30 tables. Project page: this https URL

点击查看摘要

Abstract:Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emphthe same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ( \beta =0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ( \beta =1.02) in the same regimes, a 2.3\times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard–soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer–architecture co-design.

[LG-88] stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

链接: https://arxiv.org/abs/2605.21800
作者: Lucas Maes,Quentin Le Lidec,Luiz Facury,Nassim Massaudi,Ayush Chaurasia,Francesco Capuano,Richard Gao,Taj Gillin,Dan Haramati,Damien Scieur,Yann LeCun,Randall Balestriero
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \textttswm dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

[LG-89] hree Costs of Amortizing Gaussian Process Inference with Neural Processes

链接: https://arxiv.org/abs/2605.21798
作者: Robin Young
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear at ProbNum 2026

点击查看摘要

Abstract:Neural processes amortize Gaussian process inference, replacing the exact O(n^3) posterior with a learned O(n) map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback–Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension d as O(e^-cd^2/d_x) for squared-exponential kernels on \mathbbR^d_x where c 0 is a kernel-dependent constant and as O(d^-2\nu/d_x) for Matérn- \nu kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is O(1) in general, with only the observation-noise component decaying as O(1/n) , identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

[LG-90] MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation ICML2026

链接: https://arxiv.org/abs/2605.21783
作者: Ahanaf Hasan Ariq
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 0 figures. Accepted at the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)

点击查看摘要

Abstract:Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley’s imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

[LG-91] Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy

链接: https://arxiv.org/abs/2605.21780
作者: Aman Saxena,Jan Schuchardt,Yan Scholten,Stephan Günnemann
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Randomized smoothing is a powerful tool for certifying robustness to adversarial perturbations, including poisoning attacks via randomized training and evasion attacks via randomized inference. Extending these guarantees to backdoor attacks, where training and test data are jointly perturbed, remains challenging because training- and test-time randomized mechanisms must be analyzed within a single robustness certificate. We address this by connecting randomized smoothing to the dual view of differential privacy through privacy profiles, which provide a numerical procedure for composing heterogeneous mechanisms. The resulting framework enables tight, modular, end-to-end certification of complex, composed mechanisms while leveraging existing analyses of differentially private mechanisms. We instantiate the framework for DP-SGD and Deep Partition Aggregation with inference-time smoothing, deriving joint robustness guarantees against both training-time and inference-time attacks. Experiments on MNIST and CIFAR-10 demonstrate the effectiveness of our framework. Overall, we provide a principled and general framework for using composite mechanisms to certify robustness under complex threat models that better capture the capabilities of real-world adversaries.

[LG-92] HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

链接: https://arxiv.org/abs/2605.21773
作者: Danyu Sun,Jinghuai Zhang,Yuan Tian,Zhou Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs’ capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings. Our evaluation of frontier LLMs reveals substantial performance gaps across datasets. While many models achieve high precision (often above 0.8) on simpler datasets, their performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. We further analyze model behavior and identify distinct regimes, including conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. Overall, our results highlight that while LLMs show strong potential for HIDS, their effectiveness is highly sensitive to data complexity, and robust system design is essential for reliable deployment.

[LG-93] Manifold-Guided Attention Steering

链接: https://arxiv.org/abs/2605.21770
作者: Ian Li,Kapilesh Guruprasad,Raunak Sengupta,Ninad Satish,Loris D’Antoni,Rose Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models frequently produce errors in reasoning tasks despite possessing the underlying knowledge required for correct reasoning. One possible approach to improve reasoning consistency is through activation steering. However, existing activation steering approaches apply fixed, pre-computed correction vectors, ignoring where the model currently sits along its generation trajectory; the result is indiscriminate perturbation that disrupts already-correct steps as freely as erroneous ones. We propose Manifold-Guided Attention Steering (MAGS), a trajectory-aware inference-time intervention grounded in a geometric observation: the output activations of specific attention heads diverge from a low-dimensional correctness manifold at the point of error, and this deviation compounds through subsequent steps. For each identified attention head, we learn a low-dimensional subspace from contrastive pairs of correct and incorrect traces that capture the directions along which error behavior deviates from correct behavior. During inference, we monitor each head’s proximity to this manifold and apply a targeted projection correction when deviation exceeds a learned threshold, steering the attention output back toward the correct subspace before the error propagates. MAGS consistently outperforms both unsteered baselines and static steering approaches across benchmarks spanning mathematical reasoning (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular generation (SMILES), suggesting that correctness manifolds are a general feature of LLM attention geometry.

[LG-94] Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

链接: https://arxiv.org/abs/2605.21765
作者: Emanuel Sommer,David Rügamer
类目: Machine Learning (cs.LG)
*备注: In Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

点击查看摘要

Abstract:The practical adoption of sampling-based inference (SAI) in Bayesian neural networks (BNNs) remains limited, partly due to persistent misconceptions about the feasibility and efficiency of sampling. This position paper argues that SAI has achieved computational parity with optimization-based methods and is at the verge of superseding such methods for effective and efficient inference in BNNs. This development should be in the interest of the whole community, promoting BNNs as a principled paradigm with its long-standing yet unfulfilled promise of providing principled uncertainty quantification for neural networks. SAI can even do more – yielding superior prediction performance through model averaging, serving as the foundation for a plethora of possible downstream tasks, and providing crucial insights into the landscape of BNNs. In order to make such a change happen and unfold the potential of sampling, overcoming current misconceptions is a necessary first step. The next step is to realign research efforts toward addressing remaining challenges in SAI. In particular, the community must focus on two core problems: sufficient exploration of the posterior landscape and high-fidelity distillation of posterior samples for efficient downstream inference. By addressing conceptual and practical obstacles, we can unlock the full potential of SAI and establish it as a central tool in Bayesian deep learning.

[LG-95] On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

链接: https://arxiv.org/abs/2605.21763
作者: Oliver Mortensen,Mohammad Sadegh Talebi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Accepted to RLC 2026. arXiv admin note: substantial text overlap with arXiv:2506.00286

点击查看摘要

Abstract:We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions u for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever u does not have full domain \textdom(u)\neq \mathbbR , the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size SA of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon \frac11-\gamma explicit. Specifically, for \textCVaR_\tau we show that the correct dependence on \tau is \frac1\tau^2 , thus improving by a factor of \frac1\tau over state-of-the-art although our bound has a suboptimal dependence on \frac11-\gamma .

[LG-96] Machine learning prediction of obstructive coronary artery disease using opportunistic coronary calcium and epicardial fat assessments from CT calcium scoring scans

链接: https://arxiv.org/abs/2605.21762
作者: Juhwan Lee,Ammar Hoori,Tao Hu,Justin N. Kim,Mohamed H. E. Makhlouf,Michelle C. Williams,David E. Newby,Robert Gilkeson,Sanjay Rajagopalan,David L. Wilson
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Non-contrast computed tomography calcium scoring (CTCS) is a cost-effective imaging modality widely used to detect coronary artery calcifications. This study aimed to develop an advanced machine learning framework that utilizes quantitative analyses of coronary calcium and epicardial fat from CTCS images to predict obstructive coronary artery disease (CAD). The study population consisted of 1,324 patients from the SCOT-HEART clinical trial who underwent both CTCS and coronary CT angiography. We extracted and analyzed a broad range of features, including 24 clinical variables, 189 calcium-omics, and 211 epicardial fat-omics features from the CTCS images. Feature selection was conducted using the CatBoost algorithm combined with SHapley Additive exPlanation (SHAP) values. Predictive modeling utilized the CatBoost gradient boosting method, focusing on the most informative features. From an initial set of 424 candidate features, 14 were identified as most predictive through the CatBoost-SHAP method. The top two predictive features originated from fat-omics, with the remaining 12 features derived from calcium-omics. The optimized model achieved robust predictive capabilities, demonstrating a sensitivity of 83.1+/-4.6%, specificity of 93.8+/-1.7%, accuracy of 85.3+/-2.0%, and an F1 score of 73.9+/-3.3%. Inclusion of calcium-omics and fat-omics data significantly improved predictive performance. Notably, the model also showed reliable predictive accuracy in patients with diverse coronary calcium scores, including cases with obstructive CAD despite a zero-calcium score. This innovative approach holds promise for improving clinical decision-making and potentially reducing dependence on contrast-enhanced or invasive diagnostic procedures, particularly within low-to intermediate-risk patient groups.

[LG-97] Models Can Model But Cant Bind: Structured Grounding in Text-to-Optimization

链接: https://arxiv.org/abs/2605.21751
作者: Zhiqi Gao,Albert Ge,Alexander Berenbeim,Nathaniel D. Bastian,Frederic Sala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-optimization requires two separable capabilities: modeling – choosing the right optimization structure – and binding – grounding every coefficient, index, and parameter in the concrete problem data. We study this via Text2Opt-Bench, a scalable benchmark of solver-verified optimization problems spanning 12 categories, from textbook linear programs to stochastic and multi-objective formulations with up to thousands of variables. Across 10+ models, we find that accuracy collapses as instance data grows, even when the formulation itself is simple. We call this the effective binding limit. We address this via a simple inference-time approach, BIND, which externalizes numeric data to structured files so the model binds data programmatically rather than transcribing from the prompt. BIND improves GPT-5-Nano from 59.1% to 82.4% accuracy, matching pass@5 (82.0%) at lower token cost than pass@1, and GPT-5 from 86.2% to 95.8%. Furthermore, we validate our hypothesis by finetuning a model exclusively on binding and show that it outperforms end-to-end SFT and RL across three structurally distinct optimization categories, with a 1.5B binding specialist alone matching a 7B end-to-end baseline.

[LG-98] Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring

链接: https://arxiv.org/abs/2605.21745
作者: Juhwan Lee,Sadeer Al-Kindi,Ammar Hoori,Tao Hu,Hao Wu,Justin N. Kim,Robert Gilkeson,Sanjay Rajagopalan,David L. Wilson
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Non-contrast computed tomography calcium scoring (CTCS) is widely recognized as an effective tool for cardiovascular risk stratification. This study aimed to develop a novel machine learning framework for predicting myocardial ischemia from routine non-contrast CTCS scans using quantitative coronary calcium assessment. This study analyzed 1,375 patients who underwent both non-contrast CTCS and regadenoson stress cardiac positron emission tomography myocardial perfusion imaging within one year at University Hospitals Cleveland Medical Center. A total of 74 variables, including clinical variables, Agatston score, and calcium-omics features, were evaluated. Relevant features were identified using XGBoost with Shapley Additive exPlanations (SHAP). Predictive models were trained and evaluated using 5-fold cross-validation. Among 987 patients, 89 (9%) were positive for myocardial ischemia. The final model incorporated the Agatston score, eight calcium-omics features, and age. The proposed model achieved a precision of 98.9+/-3.0%, sensitivity of 79.2+/-8.4, and F1 score of 87.7+/-5.3%. The addition of calcium-omics features significantly improved predictive performance compared with models using clinical variables alone or clinical variables with the Agatston score (p0.05). Interestingly, the number of calcified arteries, despite being the lowest-ranked feature based on SHAP analysis, showed the strongest association with myocardial ischemia in logistic regression analysis (odds ratio: 3.63, 95% confidence interval: 2.80-4.77, p0.00001). We developed a machine learning approach for predicting myocardial ischemia using routinely acquired non-contrast CTCS scans. Calcium-omics features provided incremental predictive value beyond conventional risk factors and Agatston scoring and may support more accessible cardiovascular risk stratification.

[LG-99] Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification

链接: https://arxiv.org/abs/2605.21742
作者: Samuel McDowell,Nathan Stromberg,Lalitha Sankar
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 5 pages, 6 figures, Information Theory Workshop (ITW)

点击查看摘要

Abstract:Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

[LG-100] I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models

链接: https://arxiv.org/abs/2605.21731
作者: Barbara Tarantino,Gennaro Auricchio,Paolo Giudici
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models are increasingly used in scientific prediction tasks where strong benchmark performance is often interpreted as evidence of scientifically meaningful behavior. This interpretation is fragile, as models may exploit shortcut features, dataset-specific regularities, or distributional biases that are predictive on held-out data but not aligned with domain-relevant structure. To address this limitation, we introduce the \textscI-SAFE (Interventional Secure, Accurate, Fair and Explainable) framework, a post-hoc distributional auditing framework for scientific AI models centered on the Wasserstein Coherence Metric (WCM). Given a trained black-box predictor and an external structural prior encoding domain knowledge about task-relevant input structure, \textscI-SAFE evaluates raw model outputs under structurally guided perturbations of the input. The proposed audit measures output-distribution coherence through three complementary metrics: a Quantile-Based Metric (QBM) for location-level coherence, the WCM for ordinal coherence, and a translation-invariant WCM variant for shape coherence. We instantiate \textscI-SAFE on drug–target interaction (DTI) prediction using the Davis kinase benchmark, KLIFS (Kinase–Ligand Interaction Fingerprints and Structures) binding-pocket annotations, and three sequence-based DTI models: DeepConvDTI, DeepDTA, and TAPB. Although the models operate in a comparable predictive regime, \textscI-SAFE reveals substantially different distributional response profiles, a distinction invisible to accuracy-based evaluation. The framework is model-agnostic and applicable to any domain where inputs admit a structured decomposition and an external prior is available.

[LG-101] Zero-shot adaptation to order book dynamics

链接: https://arxiv.org/abs/2605.21707
作者: Arip Asadulaev
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We describe an adaptive market-making architecture that preserves the analytical structure of the Avellaneda–Stoikov framework while introducing a successor measure-style adaptation mechanism. In our paper we keep Avellaneda–Stoikov fast Hamilton–Jacobi–Bellman structure and make it adaptive to changing market regimes and trading objectives. The central idea is to separate market dynamics from the trading objective. The market state determines a low-dimensional set of Avellaneda–Stoikov parameters, while recent realized rewards determine a low-dimensional objective vector. The HJB forward map then converts this objective into optimal bid and ask quotes through a scalarization of future reward features.

[LG-102] Representation Gap: Explaining the Unreason able Effectiveness of Neural Networks from a Geometric Perspective

链接: https://arxiv.org/abs/2605.21692
作者: David Perera,Victor Moura,Lais Isabelle Alves dos Santos,Michel F. C. Haddad,Flavio Figueiredo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners’ intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textitintrinsic dimension of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.

[LG-103] Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos ICML2026

链接: https://arxiv.org/abs/2605.21648
作者: Lucas Fernandez Sarmiento
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 36 pages, 11 figures

点击查看摘要

Abstract:We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, ReLU-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a rank-flow tie-breaker then selects front-loaded schedules, substantially reducing held-out test loss at no extra computational cost, with accuracy gains as a consistent secondary effect. We test the predictions in MLPs and Vision Transformers and discuss CNN/ResNet extensions.

[LG-104] Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

链接: https://arxiv.org/abs/2605.21646
作者: Jacek Karolczak,Jerzy Stefanowski
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in International Journal of Applied Mathematics and Computer Science (IJAMCS)

点击查看摘要

Abstract:Prototype-based explanations offer an intuitive, example-based approach to support the interpretability of machine learning black box classifiers but often lack feature-level granularity. We introduce a framework that integrates feature importance at two levels to address this gap. First, for local explanations, we propose \textitalike parts: a method that uses feature importance scores to highlight the most relevant, shared feature subsets between a classified instance and its nearest prototype, guiding user attention. Second, we augment the global prototype selection objective function with a feature importance term to actively promote diversity in the feature attributions of the selected prototypes. Experiments on six benchmark datasets show that this augmented selection process maintains or, in some cases, increases the prediction fidelity of the surrogate model, suggesting that feature diversity does not compromise model fidelity.

[LG-105] textitBlockFormer : Transformer-based inference from interaction maps

链接: https://arxiv.org/abs/2605.21617
作者: Eloïse Touron,Pedro L. C. Rodrigues,Julyan Arbel,Nelle Varoquaux,Michael Arbel
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques – notably Hi-C – can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between entities through blocks of variable numbers and sizes. In this work, we introduce a data-driven approach that leverages shared structure between these maps, such as global alignment between localized patterns, while handling the variability in number and size of entities arising in real-world data. Our approach relies on a transformer architecture capable of handling such variability and a custom simulator to generate abundant, yet computationally cheap synthetic data for training. Applied to the problem of centromere localization, the method accurately recovers their genomic positions across a wide range of species of various genome sizes.

[LG-106] ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverag e

链接: https://arxiv.org/abs/2605.21615
作者: Chang Liu,Noah Fleischmann,Nicolò Altamura,Edward Raff,James Holt,Kristopher Micinski
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cross-version history, and CVE labels into a queryable structure. We present ASSEMBLAGE-DEEPHISTORY, which consolidates these dimensions into a unified framework where every binary’s compilation context, source code, vulnerable functions, and package version are stored as first-class metadata. ASSEMBLAGE-DEEPHISTORY comprises 73,610 binaries spanning 248 open-source projects, compiled across GCC, Clang, and MSVC at multiple optimization levels on Linux and Windows, with multi-year historical builds. Each binary is indexed in a database that links it to its source code, functions, debug info, variant builds, historical versions, and vulnerable functions. Three analyses demonstrate this structure’s value: (1) a three-stage LLM benchmark (recognition, strategy-guided detection, and cross-build transfer) to test whether LLMs reason about binary vulnerabilities or pattern-match on build-specific artifacts; (2) a comparison of MalConv embeddings, jTrans function embeddings, and TLSH fuzzy hashes quantifying how same-package versions cluster in each space; and (3) a Bayesian regression decomposing binary similarity into contributions from temporal distance, file changes, and commits. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2605.21615 [cs.CR] (or arXiv:2605.21615v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.21615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-107] AgForce Enables Antigen-conditioned Generative Antibody Design

链接: https://arxiv.org/abs/2605.21610
作者: Mansoor Ahmed,Murray Patterson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Antibody design methods condition on antigen structure to generate complementarity-determining regions (CDR), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input. We identify three failure modes that explain this behavior. Antigen blindness arises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target. Vocabulary collapse reduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences. Moreover, any model trained with standard per-position cross-entropy converges to the positional marginal distribution, making it provably unable to produce antigen-specific sequence predictions. We propose a novel encoder-decoder architecture called AgForce, that uses a graph neural network (GNN) as the encoder and specialized decoders for sequence-structure co-design. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path. In the decoder, a Mixture Density Network (MDN) sequence head with Potts-like pairwise coupling and annealed Multiple Choice Learning (aMCL) replaces the cross-entropy objective with a multi-component distribution whose optimal solution differs from the positional marginal. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity. AgForce achieves the best binding quality and sequence recovery simultaneously on the CHIMERA-Bench dataset, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods. The source code is available at: this https URL

[LG-108] ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

链接: https://arxiv.org/abs/2605.21600
作者: Mansoor Ahmed,Spencer VonBank,Nadeem Taj,Sujin Lee,Naila Jan,Murray Patterson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational antibody CDR design methods condition on antigen structure to generate binding loops, yet existing architectures conflate two fundamentally distinct sub-problems: identifying which CDR positions will contact the antigen, and selecting amino acids at those positions. This conflation forces models to learn contact reasoning implicitly through uniform message passing, diluting antigen signal across all positions equally. We introduce ConTact, a contact-then-act architecture that explicitly decomposes CDR design into three cascaded stages: learning surface complementarity fingerprints, predicting CDR-antigen contacts, and injecting contact-gated antigen features into the sequence head. A distance-biased cross-attention module encodes geometric priors favoring spatial neighbors, while a contact-weighted cross-entropy loss concentrates gradient signal on binding-critical positions. On CHIMERA-Bench dataset, ConTact achieves the best structural quality (7% RMSD improvement over the next-best baseline), best epitope awareness (10% F1 score over GNN baselines), and competitive sequence recovery (AAR 0.38) among several CDR-H3 design baselines.

[LG-109] Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

链接: https://arxiv.org/abs/2605.21568
作者: Jack Kendall
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we extend the Equilibrium Propagation framework to skew-gradient systems and show an equivalence between deep Energy-Based Models and Hamiltonian neural networks. We focus on networks of diffusively coupled Fitzhugh-Nagumo neurons as a prototypical example. We show that since stationary solutions of the Fitzhugh-Nagumo model are described by self-adjoint operators, the methods of equilibrium propagation for performing credit assignment can be applied. Furthermore, for Fitzhugh-Nagumo networks with the topology of a deep residual network, we show that the steady state solutions admit a (spatial) Hamiltonian, and thus the methods of Hamiltonian Echo Backpropagation can be applied. We end by deriving an explicit layer-wise Hamiltonian recurrence relation governing inference for stationary solutions of both deep Fitzhugh-Nagumo networks and deep Energy-Based Models.

[LG-110] Calibration Uncertainty Communication and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

链接: https://arxiv.org/abs/2605.21566
作者: Michael O. Eniolade
类目: Machine Learning (cs.LG)
*备注: 27 pages, 6 figures, 4 tables. Supplementary materials (S1-S4) included as ancillary file

点击查看摘要

Abstract:Machine learning models for chronic kidney disease (CKD) risk prediction often post strong discrimination scores on internal test sets. Calibration and uncertainty quantification get far less attention, leaving clinicians without reliable information about whether the probability outputs are accurate. We trained five classifiers on the UCI CKD dataset (400 patients, 62.5% CKD prevalence): logistic regression, random forest, XGBoost, SVM with Platt scaling, and Gaussian naive Bayes. We evaluated each across calibration quality, conformal prediction coverage, and an eight-criterion deployment readiness framework. A distributional stress-test applied the best-calibrated variant of each model to the open-access MIMIC-IV demo cohort (97 patients, 23.7% CKD) to assess behaviour under prevalence shift and feature missingness. We measured calibration before and after Platt scaling and isotonic regression using Expected Calibration Error and Brier Score, and quantified uncertainty through split conformal prediction targeting 90% marginal coverage. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist. Near-perfect internal performance did not transfer. Calibration stability and conformal coverage should be evaluated on external data before any clinical prediction model moves toward deployment. Comments: 27 pages, 6 figures, 4 tables. Supplementary materials (S1-S4) included as ancillary file Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.21566 [cs.LG] (or arXiv:2605.21566v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.21566 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Michael Eniolade [view email] [v1] Wed, 20 May 2026 17:12:48 UTC (2,199 KB)

[LG-111] Leverag ing Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

链接: https://arxiv.org/abs/2605.21565
作者: Phuong-Anh Nguyen,The-Son Le,Duc-Trong Le,Cam-Van Thi Nguyen
类目: Machine Learning (cs.LG)
*备注: Accepted at Neural Computing and Applications (Springer), 2026

点击查看摘要

Abstract:Multimodal Emotion Recognition in Conversations (MERC) is a crucial task for understanding human interactions, where multimodal approaches integrating language, facial expressions, and vocal tone have achieved significant progress. However, modality misalignment and imbalanced learning remain major challenges, limiting the effective utilization of multimodal information. To address this issue, we propose a plug-and-play framework based on Self-Paced Curriculum Learning (SPCL) for MERC. We introduce a dual-level Difficulty Measurer that captures both utterance-level and conversation-level challenges. The utterance-level score models fine-grained modality-specific difficulty, while the conversation-level score captures broader dialogue structures, including emotional dependencies and modality coherence. Based on these scores, the Learning Scheduler dynamically guides training from easier to more difficult instances. By integrating SPCL into existing MERC architectures, our method alleviates modality imbalance and improves model robustness. Extensive experiments on the IEMOCAP and MELD datasets demonstrate consistent improvements across different architectures and modality settings. On IEMOCAP, SPCL improves weighted F1-score by approximately +1.2% to +6.6% over baseline models, while on MELD, gains reach up to +10.4%. These results highlight the effectiveness and generalizability of SPCL as a lightweight plug-and-play module for multimodal emotion recognition.

[LG-112] Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

链接: https://arxiv.org/abs/2605.21563
作者: Fan Zhang,Simon Deltadahl,Majid Lotfian Delouee,Daniel Kreuter,Joseph Taylor,Allerdien Visser,BloodCounts Consortium,James H. F. Rudd,Nicholas S. Gleadall,Suthesh Sivapalaratnam,Folkert Asselbergs,Martijn C. Schut,Michael Roberts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent reviews find that the vast majority of published healthcare federated learning (FL) studies never reach real-world deployment. We developed an embedding-based FL pipeline for iron deficiency prediction from routine full blood count (FBC) data and deployed it across real institutional environments at Amsterdam University Medical Centre (AUMC) and NHS Blood and Transplant (NHSBT), two clinical environments that differ markedly in iron deficiency prevalence, ferritin distribution, and subject populations. A frozen domain-specific haematology foundation model, DeepCBC, performs site-local representation extraction, restricting federated training to a compact downstream classifier and substantially reducing recurrent communication relative to full-encoder federation. The two clinical datasets are structurally not independent and identically distributed (non-IID), with heterogeneity arising from distinct population differences rather than sampling artefacts. Runtime governance is enforced by FLA ^3 , a healthcare-oriented FL platform providing study-scoped execution, policy-based authorisation, and signed audit logging. Standard sample-size-weighted aggregation (FedAvg) reduced the area under the receiver operating characteristic curve (ROC-AUC) at both sites relative to local-only training, as the global update was biased towards the larger AUMC distribution. FedMAP, a personalised aggregation method, raised ROC-AUC from 0.9470 to 0.9594 at AUMC and from 0.8558 to 0.8671 at NHSBT relative to local-only training, achieving the highest macro ROC-AUC of 0.9133 and the best macro balanced accuracy overall. These results support personalised aggregation in clinical federations where client sample size and task relevance diverge substantially.

[LG-113] Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

链接: https://arxiv.org/abs/2605.21561
作者: Mathieu Cherpitel,Thomas Bäck,Martijn R. Tannemaat,Anna V. Kononova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised feature selection is commonly formulated as a multiobjective optimisation problem that jointly optimises subset quality and subset size. Yet the behaviour of this formulation depends critically on the choice of evaluation objective, the direction of subset-size regularisation, and the initialisation strategy. We study these factors in a controlled setting using a synthetic dataset with known informative, redundant, and irrelevant feature types. Six formulations are compared by combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss with subset-size minimisation or maximisation. The results show that formulation strongly affects both search dynamics and the quality of the resulting Pareto front. Silhouette-based formulations exhibit a strong bias toward trivial low-cardinality solutions and remain weak proxies for predictive performance. In contrast, the proposed PCA loss objective produces compact subsets with test accuracy comparable to subsets obtained by directly optimising supervised accuracy. Overall, the study shows that objective design is central to effective multiobjective unsupervised feature selection.

[LG-114] AutoMCU: Feasibility-First MCU Neural Network Customization via LLM -based Multi-Agent Systems

链接: https://arxiv.org/abs/2605.21560
作者: Penglin Dai,Zijie Zhou,Xincao Xu,Junhua Wang,Xiao Wu,Lixin Duan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying neural networks on microcontroller units (MCUs) is critical for edge intelligence but remains challenging due to tight memory, storage, and computation constraints. Existing approaches, such as model compression and hardware-aware neural architecture search (HW-NAS), often depend on proxy metrics, incur high search cost, and do not fully bridge the gap between architecture design and verified deployment. This paper presents AutoMCU, a feasibility-first large language model (LLM)-based multi-agent system for automated neural network customization under MCU constraints. Given natural-language task requirements and hardware specifications, AutoMCU iteratively generates structured architecture candidates, filters infeasible designs through vendor toolchain feedback before training, evaluates feasible models under a controlled protocol, and verifies deployability through backend-grounded deployment analysis. AutoMCU includes two key mechanisms: 1) hardware-in-the-loop architecture generation for early elimination of undeployable candidates under RAM and Flash constraints, and 2) state-isolated multi-agent scheduling for stable coordination of proposal, training, evaluation, and deployment stages. Experiments on CIFAR-10 and CIFAR-100 under strict MCU constraints show that AutoMCU achieves competitive accuracy while reducing customization time to about 1–2 hours, compared with hundreds of GPU hours for representative MCU-oriented HW-NAS baselines. Comparisons with ColabNAS and the LLM-based NAS method GENIUS on NAS-Bench-201 further demonstrate the effectiveness and stability of AutoMCU. Real-device deployments on multiple STM32 microcontrollers validate its practical applicability to MCU-scale edge intelligence.

[LG-115] Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising SIGIR

链接: https://arxiv.org/abs/2605.21556
作者: Zhaoqi Zhang,Jiaming Deng,Miao Xie,Linyou Cai,Qianlong Xie,Xingxing Wang,Siqiang Luo,Gao Cong
类目: Machine Learning (cs.LG)
*备注: Accepted at SIGIR Industry Track 2026

点击查看摘要

Abstract:Guaranteed display advertising is crucial for platform monetization, yet existing methods often operate under a single-slot assumption, limiting their ability to optimize allocation across multi-slot page views. In this paper, we propose a novel joint optimization framework for multi-slot GD allocation, addressing key challenges such as slot-level redundancy, contract imbalance, and exposure concentration. Our approach formulates the allocation as an offline bipartite matching problem with a contract roulette mechanism for slot exclusivity and Page View constraints for impression control, and incorporates a scalable allocation optimization algorithm for efficient large-scale deployment. Extensive online tests on the Meituan advertising platform demonstrate that our method significantly improves merchant ROI, platform revenue efficiency, and contract fulfillment robustness. Specifically, online A/B tests show a 28.99% increase in Average Revenue Per User under 70% traffic, and DID analysis further indicates improved contract stability, demonstrating the strong applicability and effectiveness of our framework in real-world advertising deployments.

[LG-116] ONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems

链接: https://arxiv.org/abs/2605.21553
作者: Sige Liu,Kezhi Wang
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Image and Video Processing (eess.IV)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:Tokens are becoming the basic units through which foundation models represent and process information for understanding and inference. However, traditional wireless communication, centered on bit-level fidelity, faces a mismatch between what is transmitted reliably and what downstream models actually consume. This mismatch calls for a communication design that directly accounts for token-level task relevance and downstream model requirements, rather than treating all transmitted bits as equally important. In this paper, we propose TONIC, a token-centric semantic communication framework for task-oriented wireless systems. The transmitter converts each source sample into a sequence of tokens, estimates token-level task relevance, and allocates protection through utility-aware unequal error protection under a fixed channel-use budget. At the receiver, token-level confidence is used to gate unreliable decisions, turning harmful substitutions into recoverable erasures before a Transformer-based completion model restores the masked tokens for final task inference. Our framework combines transmitter-side semantic-aware protection with receiver-side confidence-aware gating in a modular and interpretable architecture, rather than relying solely on fully black-box end-to-end learning. We further establish a utility-aware Bayes-risk interpretation for the receiver-side gating rule and study its interaction with unequal protection and completion. Experimental results on image classification show that TONIC consistently outperforms separation-based schemes, the pixel-domain DeepJSCC baseline, and token-domain baselines under matched communication budgets over AWGN, Rayleigh, and Rician channels.

[LG-117] Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift ICML2026

链接: https://arxiv.org/abs/2605.21552
作者: Jinzong Dong,Zhaohui Jiang,Bo Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed, limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

[LG-118] PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting

链接: https://arxiv.org/abs/2605.21550
作者: Wangzhi Yu,Peng Zhu,Qing Zhao,Yiwen Jiang,Dawei Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electricity load peak forecasting (ELPF), simultaneously predicting peak timing and intensity, is a prerequisite for effective grid scheduling and risk management. However, existing methods face three limitations. First, they adopt a two-stage predict-then-locate paradigm, which severs the link between temporal localization and intensity regression. Second, they still struggle with the multi-scale representation conflict, leading to peak misjudgment and timing misalignment. Third, the lack of explicit peak timing context during intensity regression causes intensity smoothing because predictions are dominated by global smoothing trends. To address these limitations, we propose PeakFocus, a unified framework for ELPF. (i) A Unified Peak-Aware Pipeline (UPAP) utilizes a triple hybrid loss to jointly supervise temporal localization and intensity regression, alongside a tolerance-based evaluation protocol. (ii) A Multi-Scale Mixing Peak Locator (MSM-PL) exploits coarse-grained features to mitigate peak misjudgment caused by local fluctuations, and injects them into fine-grained features via a cascade mechanism to resolve timing misalignment. (iii) A Location-Aware Decoder (LAD) injects peak timing context into the intensity regression process, providing explicit guidance to counteract intensity smoothing and improve peak intensity estimation. Extensive experiments on the public Electricity (ELC) dataset and our industrial-scale World Large-scale Electricity Load (WLEL) dataset show that PeakFocus outperforms baselines in both timing precision and intensity estimation.

[LG-119] abular foundation models for robust calibration of near-infrared chemical sensing data

链接: https://arxiv.org/abs/2605.21544
作者: Robin Reiter,Denis Cornet,Fabien Michel,Lauriane Rouan,Gregory Beurier
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 56 pages, 17 figures, including supplementary material

点击查看摘要

Abstract:Near-infrared spectroscopy is increasingly used as a rapid, non-destructive chemical sensing technology for the analysis of food, pharmaceutical, biological, and environmental samples. However, the practical deployment of NIR sensors still depends on calibration models able to handle high-dimensional, collinear spectra, limited sample sizes, preprocessing dependence, spectral outliers, and extrapolation beyond the calibration domain. Here, we evaluate whether tabular foundation models can provide a new calibration strategy for NIR chemical sensing. We benchmark TabPFN on 66 NIR datasets covering 54 regression and 12 classification tasks, and compare direct inference on raw spectra with preprocessing-optimized inference against PLS/PLS-DA, Ridge, Catboost, and one-dimensional convolutional neural networks. The study uses a unified validation framework in which preprocessing and model selection are performed exclusively on calibration data before external test evaluation. In regression, preprocessing-optimized TabPFN achieves the best overall average rank and significantly outperforms PLS, CatBoost, TabPFN on raw spectra, and CNN-1D, while remaining statistically comparable to Ridge. In classification, TabPFN applied directly to raw spectra provides the best average rank, with performance close to the optimized variant. Robustness analyses show that TabPFN provides strong average predictive performance but that its advantage decreases on spectral outliers and extrapolated samples, where classical chemometric models remain competitive. These results suggest that tabular foundation models can complement established chemometric workflows for NIR chemical sensing, especially in small- to medium-sized calibration settings, while highlighting the need for spectroscopy-specific priors and uncertainty-aware deployment strategies.

[LG-120] Provable Joint Decontamination for Benchmarking Multiple Large Language Models

链接: https://arxiv.org/abs/2605.21543
作者: Zhenlong Liu,Hao Zeng,Hongxin Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.

[LG-121] Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

链接: https://arxiv.org/abs/2605.21542
作者: Andi Xu
类目: Machine Learning (cs.LG)
*备注: Preprint/technical paper. An interpretable neural audit framework for entity-conditioned lag discovery in panel time series. 10 pages, 5 figures, 16 tables. Code available at the GitHub repository

点击查看摘要

Abstract:Country-level temporal panels are widely used in empirical analysis. Researchers often need to audit how different entities respond to historical signals over different time horizons. Current approaches typically do not provide directly auditable entity-specific lag summaries. We formulate entity-conditioned heterogeneous lag discovery as a temporal panel mining task and propose AC-GATE, an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate. It instantiates conditional Moderated Distributed Lag by using observable entity-level proxies to condition lag-weight distributions over historical observations, thereby making effective lags structural outputs of the model rather than post-hoc explanations. The evaluation is based on a layered audit protocol that separates predictive calibration from lag discovery. A synthetic panel with known ground-truth lags is used for mechanism recovery testing, and two real-world country-level panels are used for external audit and stress testing. The results show that AC-GATE can recover heterogeneous lag structure in synthetic data, and generates non-degenerate, externally structured effective lags in real data.

[LG-122] DualOptim: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models ICML2026

链接: https://arxiv.org/abs/2605.21539
作者: Xuyang Zhong,Qizhang Li,Yiwen Guo,Chen Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:We propose DualOptim+, a novel optimization framework for improving machine unlearning in large language models. It introduces a base state to capture common representations shared by forgetting and retaining objectives and delta states to preserve objective-specific residuals. This architecture allows the optimizer to adaptively bridge shared and decoupled states based on the directional conflict between forgetting and retaining gradients. We further introduce DualOptim+ 8bit, a quantized variant that reduces memory overhead without compromising performance. Extensive experiments across fictitious and real-world unlearning, safety alignment, and multi-task learning tasks demonstrate that DualOptim+ consistently achieves a superior trade-off between different objectives. Codes are available at this https URL.

[LG-123] Neural Acceleration for Graph Partitioning

链接: https://arxiv.org/abs/2605.21519
作者: Joshua Dennis Booth,Vishvam Patel
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Partitioning is a critical problem in numerous scientific and engineering domains including social network analysis, VLSI design, and many more. Spectral methods are known to produce quality partitions while minimizing edge cuts for a wide range of problems. However, the computational cost associated with the calculation of the Fiedler vector, an eigenvector associated with the second smallest eigenvalue of the graph Laplacian, remains a significant bottleneck due to memory issues and computational costs. In this paper, we present an accelerated approach to spectral bisection partitioning by replacing the traditional eigenvalue calculation with a simple artificial neural network model to approximate the Fiedler vector. We demonstrate that our approach achieves partitioning quality comparable to spectral bisection while significantly reducing the computational overhead, making it more scalable and efficient for large-scale problems

[LG-124] Conditional Entropy of Heat Diffusion on Temporal Networks

链接: https://arxiv.org/abs/2605.21514
作者: Samuel Koovely,Alexandre Bovet
类目: ocial and Information Networks (cs.SI); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Many complex systems can be modeled by temporal networks, whose organization often evolves through distinct structural phases. Detecting the change points that delimit these phases is both important and challenging. In this work, we extend the conditional entropy of heat diffusion from static graphs to temporal networks and study its properties. We provide an upper bound and explain how discrepancies from it arise from the presence of asymmetric temporal paths. Moreover, we show that this quantity is monotone in time, yielding an information-theoretic analog of the second law of thermodynamics for inhomogeneous diffusion on temporal networks. We then introduce a local version of conditional entropy, designed to probe diffusion over finite temporal windows, and show that it provides an informative signal for change-point detection in continuous-time temporal networks. We evaluate the proposed methodology on synthetic benchmarks, including comparative experiments with existing nonparametric baselines in the snapshot setting, and then apply it to a real-world temporal contact network from a French primary school. Finally, we show how to use detected change points to perform community detection on targeted sub-intervals, improving the quality and interpretability of the clustering results.

[LG-125] Community-Aware Vertex Ordering for Reference-Based Graph Compression: A Cross-Encoder Empirical Study

链接: https://arxiv.org/abs/2605.21510
作者: Jimmy Dubuisson
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 26 pages, 7 figures, 9 tables. Full reproducibility package at this https URL . Preprint; comments welcome

点击查看摘要

Abstract:Reference-based graph compression encodes each vertex’s neighbor list relative to a recent vertex, exploiting locality to compress large directed graphs. The dominant tool, WebGraph’s BVGraph, fixes a single encoding pipeline and relies on a separately chosen vertex ordering – typically URL-lexicographic or Layered Label Propagation (LLP). The interaction between ordering and encoder is rarely measured. We propose a two-stage Leiden+LLP vertex ordering – global LLP to seed labels, Leiden community detection, then per-cluster LLP on each induced subgraph – and study how it interacts with reference-based compression. On graphs with poor initial vertex order, reordering saves 0.3 to 5.4 bits per edge on every dataset and encoder we measured. The size of that gain is largely insensitive to the encoder: on four of five weakly ordered datasets, four independently parameterised encoders agree on the Leiden+LLP-vs-plain-LLP gain within roughly +/- 0.04 bpe. On URL-ordered web crawls, where the distributed ordering already encodes locality, adaptive encoders still benefit from reordering, but encoders tuned to URL-induced residual structure (BV-HC, CG at K1) are mildly hurt by it. To quantify how much encoder choice matters once ordering is fixed, we contribute three reference-based encoders – BG, CS, and CG – that perform per-vertex cost-optimal selection from up to 28 candidate decompositions. Each is run under its own best-tested ordering. The best of the three improves over BVGraph high-compression by 2-9% on every dataset tested, with the encoder-level gain consistently smaller than the ordering-level gain on weakly ordered datasets. The encoder framework also yields a self-delimiting bitstream that supports low-overhead random access. Comments: 26 pages, 7 figures, 9 tables. Full reproducibility package at this https URL. Preprint; comments welcome Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG) MSC classes: 68P30, 68R10 ACMclasses: E.4; E.1; G.2.2 Cite as: arXiv:2605.21510 [cs.SI] (or arXiv:2605.21510v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2605.21510 Focus to learn more arXiv-issued DOI via DataCite

[LG-126] Double descent for least-squares interpolation on contaminated data: A simulation study

链接: https://arxiv.org/abs/2605.21494
作者: Tino Werner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overparametrized models can exhibit an excellent generalization performance, although they should be prone to overfitting according to classical statistical theory. The discovery of the “double descent”, indicating that the generalization error decreases after a certain model complexity has been reached, opened a new line of research. Robust statistics considers statistical estimation on contaminated data, which, due to assumptions that do not hold on real data, let data points appear as outliers w.r.t. the assumed “ideal” distribution, potentially severely distorting any classical estimator. We address the question whether a double descent phenomenon can be observed in a linear regression setting with contaminated training data. We compare the performance of the highly non-robust least-squares interpolation estimator with several robust alternatives. It turns out that large overparametrization indeed allows for a double descent phenomenon, resulting in a very good generalization performance of the least-squares interpolator, surpassing that of the robust alternatives.

[LG-127] mporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

链接: https://arxiv.org/abs/2605.21490
作者: Danny Butvinik(NICE Actimize),Yonit Marcus(NICE Actimize),Nitzan Tal(NICE Actimize),Gabrielle Azoulay(NICE Actimize)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages, 4 figures, one table

点击查看摘要

Abstract:We introduce the Temporal Contrastive Transformer (TCT), a representation learning framework designed to capture contextual temporal dynamics in sequences of financial transactions. The model is trained using a self-supervised contrastive objective to produce embeddings that encode behavioral patterns over time, with the goal of supporting downstream fraud detection tasks. We evaluate TCT in a realistic setting by using the learned embeddings as input features to a gradient boosting classifier. Experimental results show that embeddings alone achieve meaningful predictive performance (AUC 0.8644), indicating that the model captures non-trivial temporal structure. However, when combined with domain-engineered features, no measurable improvement is observed over the baseline (AUC 0.9205 vs. 0.9245), suggesting that the learned representations largely overlap with existing feature abstractions. These findings position TCT as a promising representation learning approach that captures relevant behavioral signal, while highlighting the challenges of achieving additive value over strong domain features. The results reflect an intermediate stage in the development of temporal representation learning for financial crime detection and motivate further research on model architecture, training objectives, and integration strategies. At this early stage, achieving performance comparable to a strong feature-engineered baseline is itself a meaningful outcome, indicating that learned representations approximate domain-specific features without manual engineering. While not yet production-ready, these results point to a promising direction for reducing reliance on feature engineering in financial crime detection.

[LG-128] Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

链接: https://arxiv.org/abs/2605.20514
作者: Dan DeGenaro,Xin Li,Obed Amo,Michael Pokojovy,Sarah Adel Bargal,Markus Lange-Hegermann,Bogdan Raiţă
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 31 pages, 8 figures

点击查看摘要

Abstract:We introduce FLASH-MAX, a shallow, exact-by-construction neural network architecture for predicting homogeneous electromagnetic fields from sparse pointwise observations. Each hidden neuron represents a separate exact solution to Maxwell’s equations, so that the network satisfies the governing equations symbolically by construction and can be trained end-to-end from sparse data within seconds. We prove a universal approximation result showing that this exact model class remains universal on arbitrary domains. FLASH-MAX reaches sub-1% relative validation error from about 1K sparse pointwise observations in seconds, all while maintaining a zero PDE residual, and keeps single-digit errors even for only 100 observations sampled from 3D space. These results suggest that moving governing structure from the loss into the hypothesis class can dramatically improve the trade-off between precision and optimization speed in scientific machine learning.

[LG-129] Optimization over the intersection of manifolds

链接: https://arxiv.org/abs/2605.22736
作者: Yan Yang,Bin Gao,Ya-xiang Yuan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Differential Geometry (math.DG); Numerical Analysis (math.NA)
*备注: 26 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Optimization over the intersection of two manifolds arises in a broad range of applications, but is hindered by the coupled geometry of the feasible region. In this paper, we prove that the regularities – clean intersection and intrinsic transversality – are equivalent, which yields a tractable projection onto the tangent space of the intersection. Therefore, we propose a geometric method that employs a retraction on only one manifold and updates the iterate along two orthogonal directions. Specifically, the iterates stay on one manifold, and the two directions are responsible for asymptotically approaching the other manifold and decreasing the objective function, respectively. Under intrinsic transversality, we derive the convergence rate for both the feasibility and optimality measures, and show that every accumulation point is first-order stationary. Numerical experiments on problems stemming from sparse and low-rank optimization, including fitting spherical data, approximating hyperbolic embeddings on real data, and computing compressed modes, demonstrate the effectiveness of the proposed method.

[LG-130] Holographic functions and neural networks

链接: https://arxiv.org/abs/2605.22666
作者: Balazs Szegedy
类目: Combinatorics (math.CO); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:A fuzzy Boolean function is a map f:\cube^n\to [0,1] , where n\in\mathbb N . We introduce and compare three ways of saying that such a function has bounded complexity. The first is a sampling property: the value f(x) can be recovered, up to small error and with high probability, from the values of a bounded number of randomly chosen coordinates of x . We call this the holographic property. The second is a structural property: f is uniformly close to a bounded-degree polynomial in boundedly many bounded linear coordinate forms. The third is computational: f is uniformly close to the output of a neural network with a bounded number of non-input neurons, bounded Lipschitz activation functions and bounded incoming weights. We prove that these three properties are equivalent up to quantitative changes of the parameters. The implication from holography to polynomial structure uses a variant of a weak version of hypergraph regularity.

[LG-131] A Martingale Kernel Independence Test

链接: https://arxiv.org/abs/2605.22549
作者: Felix Laumann,Zhaolu Liu,Mauricio Barahona
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Hilbert-Schmidt Independence Criterion (HSIC) and its joint-independence extension d\mathrmHSIC are degenerate V -statistics whose data-dependent weighted- \chi^2 null limits force a permutation calibration that multiplies the per-test cost by the number of permutations, in practice two orders of magnitude. Adapting the recent martingale MMD construction for two-sample testing to the (joint) independence problem, we introduce two studentised statistics whose null distributions are standard normal regardless of the data law, so that a single normal-quantile lookup replaces the permutation step entirely. The first, m\mathrmHSIC , is a self-normalised lower-triangular sum of the Hadamard product of two empirically centred Gram matrices. Under independence and bounded-fourth-moment kernels it converges to a standard normal. It is consistent against every fixed alternative, and runs at quadratic cost in the sample size without any sample split, matching the biased HSIC V -statistic. Our second statistic, md\mathrmHSIC , achieves finite-sample consistency with a single half-sample split: the centring is estimated on one half and the lower-triangular self-normalised martingale is run on the other, shrinking the conditional-mean residual to a quantity that is exponentially small in d , so the statistic is asymptotically standard normal at every fixed number of jointly tested variables, with a per-test cost that grows only linearly in d . On synthetic data with per-variable input dimension from 1 to 500 and between 2 and 10 jointly tested variables, both statistics match the empirical type-I error rate and test power of permutation-calibrated baselines while running 25 to 60\times faster.

[LG-132] Reinforcement learning for ion shuttling on trapped-ion quantum computers

链接: https://arxiv.org/abs/2605.22463
作者: Maximilian Schier,Lea Richtmann,Christian Staufenbiel,Tobias Schmale,Daniel Borcherding,Michèle Heurs,Bodo Rosenhahn
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages + 9 pages supplementary material, 6 figures

点击查看摘要

Abstract:Scalable trapped-ion quantum computing is commonly realized with modular chips that feature distinct zones with specific functionalities, such as storage, state preparation, and gate execution. To execute a quantum circuit, the ions must be transported between these zones. This process is called ion shuttling. To achieve reliable computation results, the shuttling process must be optimized. However, as the number of ions increases, this becomes a high-dimensional optimization problem where optimal solutions cannot be computed efficiently. We demonstrate, to the best of our knowledge, the first use of reinforcement learning (RL) for the optimization of ion shuttling. RL is well-suited for such scenarios, as it enables learning a strategy through direct interaction with the problem. We show that our RL approach outperforms current state-of-the-art heuristic techniques, yielding a reduction in shuttling operations of up to 36.3 %. Furthermore, we show that our method is easily applicable to various chip architectures. Our approach offers a versatile method to study shuttling efficiency during chip design and, therefore, a highly relevant tool for future, more complex architectures.

[LG-133] Do Not Trust The Auctioneer: Learning to Bid in Feedback-Manipulated Auctions

链接: https://arxiv.org/abs/2605.22438
作者: Luigi Foscari,Matilde Tullii,Vianney Perchet
类目: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shilling is the use of artificial bids to make competition appear stronger and push prices upward. We study repeated first-price auctions in which shilling affects feedback but not allocation: the learner wins or loses against the real competing bid, but after a loss observes the maximum of the real bid and an independent shill bid. Thus the manipulation changes what the learner observes and hence how it learns to bid, without changing the outcome of the current auction. We analyze regret with respect to the best bid benchmark, assuming that the shill-bid distribution is known. Even then, shilling can mask the real bid, while useful side information appears only through intermittent low-shill events. Our algorithm combines a robust interval-elimination branch, which ignores the shilled report and achieves the dynamic-pricing rate \tilde\mathcalO(T^2/3) , with an optimistic branch that debiases losing-side reports and exploits the resulting suffix information when it is reliable and achieves the first-price auctions rate \tilde\mathcalO(\sqrtT) . A validation and racing procedure lets the algorithm use these optimistic updates without knowing the right scale or feedback geometry in advance. We complement the upper bounds with a matching lower bound, up to logarithmic factors, in the single-active-region case. Overall, the results show that even feedback-only shilling can sharply alter the statistical difficulty of repeated bidding.

[LG-134] Departure from Regularity: Degree Heterogeneity and Eigengap as the Structural Drivers of ASE-LSE Latent Subspace Disagreement

链接: https://arxiv.org/abs/2605.22346
作者: Minh Triet Pham,Ian Gallagher
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 12 pages (excluding references + appendices), 5 figures

点击查看摘要

Abstract:Two of the most widely used methods for analysing graph data, Adjacency Spectral Embedding and Laplacian Spectral Embedding, often produce different results when applied to the same network. Yet the structural reasons behind this disagreement remain incompletely understood. This paper provides a structural account. We show that regularity is a sufficient condition for perfect agreement: when every node has the same number of connections, the two methods produce identical latent subspaces. Any departure from this regularity introduces disagreement, and we prove an explicit bound whose two terms suggest the structural ingredients controlling it: degree heterogeneity, which pushes the methods apart, and community structure strength, which pulls them back together. We validate both drivers empirically across thousands of simulated networks, confirming that heterogeneity drives disagreement up, community strength suppresses it, and their ratio provides a strong predictor of when the two embeddings can be treated as interchangeable and when they cannot.

[LG-135] Spectra as Language: Large Language Models for Scalable Stellar Parameter and Abundance Inference

链接: https://arxiv.org/abs/2605.22162
作者: Hai-Ling Lu,Yu-Yang Li,Yin-Bi Li,Cun-Shi Wang,A-Li Luo,Jun-Chao Liang,Shuo Li
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stellar spectra encode key information on the physical properties and chemical compositions of stars. Accurate stellar parameter determination is essential for addressing major questions such as galaxy and stellar evolution. Large-scale spectroscopic surveys have accumulated unprecedented spectral data. Traditional feature extraction or model-fitting approaches struggle with high-dimensional, massive datasets, limited generalization, and computational inefficiency. Recent advances in large language models demonstrate strong generalization and feature-learning in tasks like natural language processing, DNA/RNA sequence analysis, and protein/chemical parsing. Stellar spectra are continuous sequential signals, enabling the transfer of language models to stellar spectroscopy. Here, we propose a two-stage large language model framework for stellar parameter inference, achieving accurate estimation of effective temperature, surface gravity, metallicity, and abundances of ~20 chemical elements. Scaling-law analyses show systematic performance improvements with increasing data, providing a scalable framework for forthcoming large-scale surveys.

[LG-136] From Betting to Empirical Bernstein LIL

链接: https://arxiv.org/abs/2605.22124
作者: Francesco Orabona
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:This is a verbatim copy of a technical report I wrote in 2017-2018 to obtain the law of the iterated logarithm using the guarantee on the wealth of an online betting strategy.

[LG-137] Self-Supervised ConvLSTM for Fermi Large Area Telescope Transient Detection

链接: https://arxiv.org/abs/2605.22112
作者: Alberto Garinei,Stefano Speziali,Alessandro Vispa,Andrea Marini,Sara Cutini,Emanuele Piccioni,Marcello Marconi,Francesco Longo,Matteo Martini,Francesca Fallucchi,Romeo Giuliano,Ernesto William De Luca,Umberto Di Matteo,Sabino Meola
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures. Accepted for publication in Astronomy and Computing. Author-accepted manuscript version

点击查看摘要

Abstract:We present a framework for detecting transient gamma-ray phenomena in a controlled environment by combining end-to-end simulations of the Fermi-LAT sky with self-supervised spatio-temporal deep learning. We generate a ten-year synthetic Universe with gtobssim and process the simulated events into daily all-sky maps of counts and exposure, obtaining a time-ordered sequence that mirrors the structure of Fermi-LAT observations. To model the nominal evolution of the sky, we employ a Convolutional Long Short-Term Memory (ConvLSTM) network that operates directly on map sequences, preserving spatial locality while learning temporal dependencies. The model is trained to reconstruct expected emission, and departures from the learned baseline are quantified through pixel-wise mean-squared residual maps. We then define statistically motivated anomaly criteria by estimating per-pixel thresholds from the residual distribution on the training set, and we enforce spatial coherence via local filtering to suppress isolated fluctuations. The ConvLSTM is then deployed as trained predictor on Fermi-LAT daily maps, where the sky can depart from the nominal behavior because of genuine astrophysical variability and instrumental non-stationarities. The resulting pipeline flags localized, time-dependent excesses consistent with high-variable sources or transient events (e.g., flares or GRBs) and provides a benchmark for evaluating anomaly-detection strategies on long-duration, Fermi-LAT-like datasets.

[LG-138] Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices

链接: https://arxiv.org/abs/2605.22097
作者: Farah Elnakhal,Alberto Marchisio,Nouhaila Innan,Gabriel Falcao,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Photonic quantum computing is a promising platform for scalable quantum machine learning, but designing effective hybrid architectures remains challenging under hardware and optimization constraints. Existing approaches rely on manually tuned architectures that fail to account for the collaboration between classical preprocessing, phase encoding, and photonic circuit structure, limiting both accuracy and hardware compatibility. In this paper, we propose a neural architecture search framework for hybrid photonic quantum-classical models that combines genetic algorithm-based search with learnable quantum phase encoding to systematically explore the joint design space of classical and quantum components. Our framework encodes 19 hyperparameters across six gene groups and evolves a population of hybrid architectures using group-based crossover, per-gene mutation, and elitism, evaluating each candidate on a short training budget before full retraining of the best found design. We evaluate our framework on two image classification benchmarks, Digits and MNIST, achieving final validation accuracies of 99.44% and 98.78%, respectively, with first-principles execution time estimates on the Quandela Ascella photonic QPU projecting single-image inference at 67 ms (Digits) and 149 ms (MNIST). Our quantum contribution analysis further shows that the photonic layer extracts non-redundant features orthogonal to the classical pathway, providing a measurable accuracy advantage over classical-only baselines. Our results demonstrate that automated architecture search is both practical and impactful for hybrid photonic systems, opening the way for systematic design space exploration of quantum AI on photonic devices.

[LG-139] Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

链接: https://arxiv.org/abs/2605.22010
作者: Margalit Glasgow,Joan Bruna
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 46 pages

点击查看摘要

Abstract:We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network f_\hat\rho_t^m to its infinite-width counterpart f_\rho_t^MF , which evolves in the mean-field dynamics. While constant-time horizon bounds for |f_\rho_t^MF - f_\hat\rho_t^m| may be obtained via standard Grönwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by L_t the mean-field excess MSE loss at time t and m the number of neurons, under standard regularity assumptions and the condition \int_0^\infty L_t^1/2 dt =O(\log d) , we obtain the uniform in time bound |f_\rho_t^MF- f_\hat\rho_t^m|^2 \lesssim \textpoly(d) m^-\min(1,c/6) whenever L_t \lesssim t^-c . Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than t^-2 , we can attain a loss of \epsilon with only \textpoly(d/\epsilon) neurons, training samples, and GD steps. Comments: 46 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2605.22010 [stat.ML] (or arXiv:2605.22010v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.22010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-140] A2QTGN: Adaptive Amplitude Quantum-Integrated Temporal Graph Network for Dynamic Link Prediction

链接: https://arxiv.org/abs/2605.21916
作者: Nouhaila Innan,M. Murali Karthick,Simeon Kandan Sonar,Vivek Chaturvedi,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Dynamic link prediction is important for modeling evolving interactions in complex systems, including social, communication, financial, and transportation networks. Classical temporal graph models capture sequential dependencies, but they may struggle to represent concurrent and rapidly changing node-edge interactions in large dynamic graphs. We propose A2QTGN (Adaptive Amplitude Quantum-Integrated Temporal Graph Network), a hybrid quantum-classical framework that combines adaptive amplitude encoding with a Temporal Graph Network backbone. The proposed mechanism represents node interaction features as quantum states and selectively refreshes amplitude embeddings based on temporal activity, preserving stable node states while emphasizing meaningful structural changes. This design reduces unnecessary quantum re-encoding and improves temporal representation for link prediction. Experiments on five Temporal Graph Benchmark datasets show that A2QTGN achieves strong predictive and ranking performance across diverse dynamic graphs. Ablation studies confirm the importance of both the quantum embedding module and the adaptive update strategy, while hardware-aware inference using a noisy backend and limited real-device execution supports the feasibility of near-term quantum-assisted temporal graph learning.

[LG-141] PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference

链接: https://arxiv.org/abs/2605.21859
作者: Yasha Ektefaie,Leo Cui,Shrey Jain,Marinka Zitnik,Pardis Sabeti
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Phylogenetic trees are hybrid objects: branch lengths vary continuously, while topologies change discretely through edge contractions and expansions. Billera-Holmes-Vogtmann (BHV) tree space provides a canonical geometry for this structure, representing each resolved topology as a Euclidean orthant and topological changes as motion across shared lower-dimensional boundaries. We introduce PhylaFlow, a hybrid flow-matching model that learns posterior-basin transport in BHV tree space. PhylaFlow is trained on BHV geodesic paths from random starting trees to short-run posterior samples, coupling continuous branch-length motion within orthants with learned boundary events and discrete topology transitions. We evaluate the learned geometry operationally: if the flow reaches posterior-relevant regions, finite-budget Bayesian refinement initialized from, or guided by, its terminal trees should recover posterior-supported topologies more efficiently. Across DS1-DS8 phylogenetic posterior benchmarks, PhylaFlow substantially reduces initial Tree-KL relative to classical initializers. After finite-budget MrBayes refinement, direct PhylaFlow improves early and intermediate topology-recovery trajectories on most datasets, while split-guided PhylaFlow-MCMC obtains the strongest hard-case results. The best PhylaFlow variant outperforms short-warmup on seven of eight datasets and PhyloGFN on five of eight under the same refinement budget. In a joint sequence-conditioned experiment, sequence embeddings steer posterior split recovery, although exact posterior topology recovery remains preliminary. These results show that hybrid flow matching can learn actionable transport in BHV tree space and provide a geometry-aware proposal mechanism for Bayesian phylogenetic inference.

[LG-142] Causal Discovery in Structural VAR Models Under Equal Noise Variance

链接: https://arxiv.org/abs/2605.21846
作者: SeyedSina Seyedi HasanAbadi,Fahimeh Arab,Erfan Nozari,AmirEmad Ghassami
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.

[LG-143] runcated Neural Likelihood Estimation for Simulation-Based Inference in State-Space Models

链接: https://arxiv.org/abs/2605.21805
作者: Kostas Tsampourakis,Víctor Elvira
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:State-space models (SSMs) are powerful probabilistic tools for modeling time-varying systems with latent dynamics. Inference in SSMs involves the estimation of latent states and parameters. In this work, we focus on parameter inference, which for SSMs is in general a very challenging problem due to the intractability of the likelihood. Recently, neural estimation methods, such as sequential neural likelihood (SNL), have shown promising results in Bayesian inference problems. In this paper, we show that SNL, when applied to the SSM setting, suffers important limitations, such as requiring a large amount of simulated samples to achieve a moderate performance, scaling poorly with sequence length, while not being amortized. We then introduce a novel inference algorithm called truncated-SNL (T-SNL), which addresses the limitations of SNL. Our algorithm is more accurate, more stable and robust during training, more scalable to longer temporal sequences, and can be amortized when new observations become available. Our experiments show that T-SNL is sample-efficient, robust, and flexible algorithm which outperforms other approaches.

[LG-144] MetaDNS: Enhancing Exploration in Discrete Neural Samplers via Well-Tempered Metadynamics ICML2026

链接: https://arxiv.org/abs/2605.21722
作者: Xiaochen Du,Juno Nam,Jaemoo Choi,Wei Guo,Sathya Edamadaka,Junyi Sha,Elton Pan,Yongxin Chen,Molei Tao,Rafael Gómez-Bombarelli
类目: atistical Mechanics (cond-mat.stat-mech); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Sampling from discrete distributions with multiple modes and energy barriers is fundamental to machine learning and computational physics. Recent discrete neural samplers like MDNS suffer from mode collapse and fail to sample high-energy barrier regions between modes, which is critical for free energy estimation and understanding phase transitions. We propose Metadynamics Discrete Neural Sampler (MetaDNS), a general framework integrating well-tempered metadynamics into discrete diffusion or autoregressive samplers. By maintaining an adaptive, history-dependent bias potential along selected low-dimensional coordinates, MetaDNS forces exploration of previously inaccessible regions, enabling free energy reconstruction infeasible with standard neural samplers due to a lack of high-energy samples. On challenging low-temperature benchmarks including Ising, Potts, and the copper-gold binary alloy, MetaDNS reproduces the thermodynamic distribution. Compared to MCMC-based metadynamics, MetaDNS also achieves comparable exploration requiring fewer bias deposition steps.

[LG-145] Adaptive RBF-KAN: A Comparative Evaluation of Dynamic Shape Parameters in Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2605.21534
作者: Roberto Cavoretto,Alessandra De Rossi,Adeeba Haider,Amir Noorizadegan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) approximate multivariate functions using learnable univariate edge functions, typically parameterized by B-spline bases. Although effective, spline-based implementations can be computationally expensive. A modified version of KANs, called FastKAN, improves efficiency by replacing splines with Gaussian radial basis functions (RBFs), but it relies on a fixed kernel and shape parameter. In this work, we extend the RBF-based KAN framework by introducing a broader family of radial basis kernels and by initializing the kernel shape parameter using leave-one-out cross-validation (LOOCV). To the best of our knowledge, this is the first study that integrates LOOCV-based kernel scale estimation with deep KAN training. We also introduce Matérn and Wendland kernels into the KAN framework for the first time, enabling more flexible basis representations beyond the Gaussian kernel used in FastKAN. The LOOCV estimate provides a data-driven initialization of the kernel scale, which is subsequently refined during network training. The proposed adaptive RBF-KAN is evaluated on several two-dimensional benchmark functions. The results highlight the importance of kernel selection and adaptive shape parameters, with different kernels showing advantages for smooth functions, discontinuities, and oscillatory patterns. Overall, combining LOOCV-based initialization with adaptive kernel learning provides a practical strategy for improving RBF-based KAN models.

[LG-146] Conditional Neural Field based Reduced Order Model for Dynamic Ditching Load Prediction

链接: https://arxiv.org/abs/2605.21499
作者: Henning Schwarz,Pyei Phyo Lin,Jens-Peter M. Zemke,Thomas Rung
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grid-based neural networks such as convolutional autoencoders are widely used in dimension reduction-based surrogate models for computational fluid dynamics. In recent years, the use of coordinate-based approaches like conditional neural fields has emerged. Their independence of the spatial discretization is a beneficial feature for various applications in computational fluid dynamics. This paper discusses the spatio-temporal prediction of aircraft ditching loads using a conditional neural field approach. The model is evaluated using two datasets for the dynamic loads of the fuselage of a DLR-D150 aircraft, one of which relates to a single fixed spatial discretization and the other that includes data from different discretizations. When paired with a long short-term memory (LSTM) network in the latent space, the neural field-based model achieves a spatio-temporal prediction accuracy for the first data set that is close to that of grid-dependent convolutional autoencoder-based models, and with significantly less parameters. Results for the second data set demonstrate the ability of the neural field-based approach to reconstruct ditching loads accurately for heterogeneous spatial discretizations. This allows for flexible use of training datasets generated for different geometries and/or discretizations, as well as the use of the surrogate model to predict loads for different configurations.

[LG-147] Multi-Modal Machine Learning for Population- and Subject-Specific lncRNA-Type 2 Diabetes Association Analysis

链接: https://arxiv.org/abs/2605.20747
作者: Ashwani Siwach,Sanjeev Narayan Sharma,Sunil Datt Sharma
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long non-coding RNAs (lncRNAs) are emerging regulatory molecules implicated in chronic disease pathogenesis, including Type 2 Diabetes Mellitus (T2D). We investigated ten literature reported lncRNAs associated with T2D: MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, and HOTAIR across two independent population-based RNA-seq cohorts. Single-omics approaches provide an incomplete view of disease biology, therefore, an integrative multi-feature framework was developed, extracting expression, secondary-structure, and sequence features for each lncRNA. Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes to ensure robust performance estimation. SHAP analysis was applied for subject-level association interpretation. In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D, while MALAT1 expression and KCNQ1OT1, ANRIL, and MEG3 sequence features were found to be associated in the second cohort. MEG3 was identified by SHAP as the dominant lncRNA in both cohorts. ML results were consistent with established statistical methods while additionally providing population- and subject-level disease association profiles linked to specific molecular feature types. The proposed framework advances mechanistic understanding of T2D and supports lncRNA-based precision medicine.