本篇博文主要内容为 2026-05-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-18)
今日共更新633篇论文,其中:
- 自然语言处理共80篇(Computation and Language (cs.CL))
- 人工智能共201篇(Artificial Intelligence (cs.AI))
- 计算机视觉共135篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共196篇(Machine Learning (cs.LG))
- 多智能体系统共18篇(Multiagent Systems (cs.MA))
- 信息检索共14篇(Information Retrieval (cs.IR))
- 人机交互共19篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
【速读】:该论文旨在解决LLM agent能否在不进行梯度更新的情况下,通过自生成的自然语言记忆来提升决策能力的问题,尤其是在随机网络防御POMDP(Partially Observable Markov Decision Process)环境中,零样本策略表现出强烈负收益和重尾分布失败率的困境。解决方案的关键是提出了FORGE(Failure-Optimized Reflective Graduation and Evolution)协议,这是一个基于种群(population-based)的分阶段协议,通过内循环(Reflexion-style)将失败轨迹转化为可重用的自然语言知识(如启发式规则Rules、少样本示例Examples或混合形式Mixed),再经由外循环在种群间广播(population broadcast)最优实例的记忆,并利用毕业准则(graduation criterion)冻结收敛实例以节省计算资源。消融实验证明,种群广播是驱动性能提升的核心机制,而毕业主要用于降低计算开销,且在不同模型(如Gemini-2.5-Flash-Lite、Grok-4-Fast等)和记忆表示(如Examples在多数模型中收益最强,Rules则以更低token消耗提供最佳成本-可靠性平衡)下均能显著改善平均评估回报(较零样本提升1.7-7.7倍,较Reflexion基线提升29-72%),并大幅降低严重失败率。
链接: https://arxiv.org/abs/2605.16233
作者: Igor Bogdanov,Chung-Horng Lung,Thomas Kunz,Jie Gao,Adrian Taylor,Marzia Zaman
机构: Carleton University(卡尔顿大学); Defence RD Canada(加拿大国防研究与发展部); Cistel Technology(Cistel技术公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance’s memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7 \times over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below -100 ) to as low as \sim 1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with \sim 40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
[MA-1] Context Reasoning and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
【速读】:该论文试图解决在对抗性、部分可观测序列环境中部署复合大语言模型(Compound LLM Agents)时,缺乏系统性设计指导的问题——实践者不清楚哪些设计选择(如上下文表示、推理机制、任务分解)能真正提升性能,而非仅仅增加推理成本。解决方案的关键在于通过控制实验(在CybORG CAGE-2这一部分可观测马尔可夫决策过程环境中进行)量化评估不同设计维度的性能与成本权衡,核心发现包括:程序化状态抽象(programmatic state abstraction)每token收益最高,在平均回报上比原始观测提升高达76%;将推理工具(deliberation tools)分布到层次结构中会显著降低性能(最多下降3.4倍),且消耗更多token,作者将这种破坏性模式称为“推理级联”(deliberation cascade);无推理的层次分解(hierarchical decomposition)在多数模型上表现最佳,上下文工程(context engineering)通常比推理更经济高效。因此,关键设计原则是:优先投资程序化基础设施和清晰的任务分解,而非追求更深入的每agent推理,因为后者在组合时可能相互干扰。
链接: https://arxiv.org/abs/2605.16205
作者: Igor Bogdanov,Chung-Horng Lung,Thomas Kunz,Jie Gao,Adrian Taylor,Marzia Zaman
机构: Carleton University (卡尔顿大学); Defence RD Canada (加拿大国防研究与发展部); Cistel Technology (Cistel技术公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4 \times worse mean return while using 1.8-2.7 \times more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
[MA-2] paper.json: A Coordination Convention for LLM -Agent -Actionable Papers
【速读】:该论文旨在解决大型语言模型代理(LLM agents)作为学术论文的首要读者时,因标准散文格式导致的三种系统性失败:子主张无法在子论文粒度(sub-paper granularity)下被引用、论文声称范围过度泛化(scope overextension)超出实际测试内容、以及图表命令隐藏在代码库中而非论文正文。解决方案的核心是提出一个随PDF发布的轻量级JSON文件(this http URL),通过五条约定(conventions)结构化地对齐机器可读与人类可读信息:C1(稳定的主张ID)、C2(明确的非主张列表,即“does-not-claim” list)、C3(每个图表的精确shell命令)、C5(稳定的定义ID),以及C4(最小可行合规性,即手写JSON与PDF共存,可在不修改人类可读输出的前提下于1小时内完成)。关键机制在于通过显式的、可引用的元数据标记,使LLM agents能够自动解析子粒度主张、明确论文的声称边界、并直接获取可执行命令,从而消除散文格式固有的语义模糊性与粒度不匹配问题。论文本身已验证了这一方案:运行uv run this http URL this http URL --against this http URL通过合规性检查。
链接: https://arxiv.org/abs/2605.16194
作者: Arquimedes Canedo
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:
Abstract:LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose this http URL, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: uv run this http URL this http URL --against this http URL passes. Repo: this https URL
[MA-3] Multi-Agent Cooperative Transportation: Optimal and Efficient Task Allocation and Path Finding
【速读】:该论文针对现有多智能体路径规划(Multi-Agent Path Finding, MAPF)与任务分配及路径规划(Task Allocation and Path Finding, TAPF)框架缺乏真正协作能力、无法运输需要多个智能体共同搬运的大型物品这一关键空白,形式化定义了合作运输任务分配与路径规划(Cooperative Transportation Task Allocation and Path Finding, CT-TAPF)问题。解决方案的核心在于两方面:其一,提出了最优求解器CT-TCBS(Cooperative Transportation Task Conflict-Based Search),其中创新的增量扩展策略(Incremental Expansion strategy)通过有效剪枝主导的任务分配搜索空间,从而缓解团队形成带来的组合爆炸问题;其二,为平衡计算成本与求解质量,开发了基于全局困难度指标(Best Task或Worst Task)的任务中心型次优求解器族,这些求解器从全局视角依次选择待分配任务,在解质量与运行时间谱系上建立了相比传统智能体中心基线方法更高效的边界。
链接: https://arxiv.org/abs/2605.16097
作者: Ning Zhou,Nikolai W.F. Bode,Edmund R. Hunt
机构: University of Bristol (布里斯托大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Multi-robot systems are integral to modern logistics, but their capabilities are often limited to tasks executable by individual agents. This paper addresses a critical gap in existing frameworks like Multi-Agent Path Finding (MAPF) and Task Allocation and Path Finding (TAPF), which lack true cooperation for transporting large items that require multiple agents. To this end, we formalise the Cooperative Transportation Task Allocation and Path Finding (CT-TAPF) problem, which integrates team formation, task assignment, and collision-free pathfinding. We present an optimal solver, Cooperative Transportation Task Conflict-Based Search (CT-TCBS), which features a novel Incremental Expansion strategy to tackle the combinatorial explosion inherent in team formation. Recognising the computational cost of optimality, we also develop a family of sub-optimal solvers that employ a global, task-centric perspective, selecting the next task to assign based on a global difficulty metric (Best Task or Worst Task). Our comprehensive empirical evaluation demonstrates three key findings: (1) the incremental expansion strategy significantly outperforms the naive combinatorial approach by successfully pruning the dominant task-allocation search space; (2) we identify a task-conflict expansion dilemma, where sophisticated conflict resolvers effective for large-agent pathfinding subproblems can be detrimental in the integrated CT-TAPF setting; and (3) our proposed sub-optimal solvers establish a new, more efficient frontier on the solution quality-runtime spectrum compared to “nn-” agent-centric baselines. This work provides a foundational framework and a set of effective algorithms for a new, practical class of cooperative multi-agent problems.
[MA-4] Who Owns This Agent ? Tracing AI Agents Back to Their Owners
【速读】:该论文旨在解决AI代理(agent)部署中缺乏可靠溯源机制的责任归因问题(agent attribution),即无法将有害的代理行为与部署其的账户相关联,导致无论是无意误配置还是恶意武器化(如诈骗、骚扰、网络攻击)的场景下,受影响方都无法通知运营者、终止会话或识别账户。解决方案的关键是一种基于“金丝雀”(canary)的协议:授权方在代理的交互流中注入一个金丝雀信号,供应商通过搜索狭窄时间窗口内的会话日志来恢复发起会话及其对应的账户。在非对抗场景中,简单的金丝雀即可有效;而在对抗性操作者过滤或改写输入内容的情况下,论文设计了鲁棒的金丝雀构造,使其无法在不降低代理自身任务性能的前提下被压制,从而形成有利于防御方的正式非对称性。
链接: https://arxiv.org/abs/2605.16035
作者: Ruben Chocron,Doron Jonathan Ben Chayim,Eyal Lenga,Gilad Gressel,Alina Oprea,Yisroel Mirsky
机构: Ben-Gurion University of the Negev (内盖夫本-古里安大学); Center for Cybersecurity Systems Networks, Amrita Vishwa Vidyapeetham (阿姆里塔维什瓦维迪亚皮瑟姆网络安全系统与网络中心); Northeastern University (东北大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Under Review
Abstract:AI agents are increasingly deployed to act autonomously in the world, yet there is still no reliable way to trace a harmful agent back to the account that deployed it. This creates the same accountability gap across both ends of the intent spectrum: benign operators may deploy misconfigured or overbroad agents that cause harm unintentionally, while malicious operators may deliberately weaponize agents for scams, harassment, or cyber attacks. In many cases, these agents are powered by vendor-hosted models, a dependency that holds even for sophisticated adversaries such as state actors conducting cyber operations. In either case, affected parties can observe the behavior but cannot notify the responsible operator, stop the session, or identify the account for investigation. We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. To our knowledge, this is the first work to define the problem and present a practical solution. Our protocol is canary-based: an authorized party injects a canary into the agent’s interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent’s own task performance, yielding a formal asymmetry in the defender’s favor. We evaluate a variety of scenarios including real-world agents and show that our attribution method is reliable, robust, and scalable for vendor-side deployment. Comments: Under Review Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2605.16035 [cs.CR] (or arXiv:2605.16035v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.16035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-5] BootstrapAgent : Distilling Repository Setup into Reusable Agent Knowledge
【速读】:该论文旨在解决代码代理(code agents)在帮助开发者处理不熟悉仓库时面临的核心瓶颈:仓库自举(repository bootstrapping)过程需要大量试错,但所获得的依赖解析、修复策略等知识仅限单次对话使用,无法被后续代理复用,造成资源浪费。解决方案的关键在于提出了一个多智能体框架BootstrapAgent,它通过证据提取、结构化规划、基于Docker的确定性验证和跟踪驱动修复,将自举探索中发现的启发式知识蒸馏为一个持久化、可验证、可被代理消费的“.bootstrap契约”(bootstrap contract)。该契约涵盖了环境配置、诊断检查、最小验证和累积修复知识。此外,论文还引入了热修复与干净重放(warm repair with clean replay)机制,在不牺牲冷启动可重复性的前提下加速迭代调试,以及增量修复与完整性检查(delta repair with sanity check)以防止奖励黑客效应。实验表明,该方法在三个基准上达到92.9%的成功率,相比基线提升超过10%,同时下游代理的令牌使用量减少25.9%,构建时间减少22.3%。
链接: https://arxiv.org/abs/2605.15815
作者: Sihan Fu,Oucheng Liu,Shiyuan Wang,Jin Shi,Chengkun Wei
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 19 pages, 9 figures, 6 tables
Abstract:Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial-and-error exploration, yet the resulting knowledge–resolved dependencies, repair strategies–stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi-agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent-consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold-start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at this https URL.
[MA-6] From Gridworlds to Warehouses: Adapting Lightweight One-shot Multi-Agent Pathfinding for AGVs IJCAI2026
【速读】:该论文旨在解决经典多智能体路径规划(MAPF)在仓库自动化应用中存在的现实差距问题,即传统模型假设四连通2D网格和单位时间四方向移动,无法适应差速驱动AGV的实际运动特性(如旋转耗时、加减速约束及追尾碰撞规避)。为此,论文提出了多智能体仓库路径规划(MAWPF),通过引入四种约束来弥合理论模型与真实场景的鸿沟。解决方案的关键在于:将PP、LNS2、PIBT和LaCAM等代表性次优MAPF算法适配至MAWPF框架,并通过系统实验验证其性能,其中基于PIBT的方法在可扩展性上表现出显著优势,尽管代价是增加了路径成本。
链接: https://arxiv.org/abs/2605.15799
作者: Hiroki Nagai,Keisuke Okumura
机构: National Institute of Advanced Industrial Science and Technology (AIST)(日本国立先进工业科学技术研究所); Keio University(庆应义塾大学)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: To be presented at IJCAI 2026
Abstract:Multi-agent pathfinding (MAPF) under one-shot planning is a core component of warehouse automation, yet classical formulations typically assume four-connected 2D grids with unit-time moves in four directions. To fill reality gaps while still being trackable with discrete combinatorial search, this work proposes a more practical counterpart tailored to differential-drive AGVs. We term this multi-agent warehouse pathfinding (MAWPF), featured with four constraints: (i) agent actions are restricted to straight motion and in-place rotation; (ii) rotations require multi-step costs; (iii) acceleration and deceleration are considered, and; (iv) follower collisions are prohibited to prevent rear-end crashes. To solve MAWPF efficiently, we adapt representative suboptimal MAPF algorithms-PP, LNS2, PIBT, and LaCAM-and conduct comprehensive benchmarking. Our experiments reveal that PP and LNS2 struggle to solve instances with many agents, while PIBT-based approaches achieve preferable scalability with increased solution cost. We believe that these constitute an important step toward adapting classical gridworld MAPF to operational warehouse setups.
[MA-7] Preserving Topology Privacy of Network Systems by Feedback: Conditions and Distributed Design
【速读】:该论文旨在解决网络系统中共识协议的拓扑隐私保护问题,即如何在维持共识行为的前提下,防止攻击者或其他第三方通过观测数据唯一或精确地重构网络的真实拓扑结构。解决方案的关键在于有意违反拓扑可识别性条件(topology identifiability conditions),通过设计基于反馈的局部通信机制,使得观测数据无法反映真实拓扑。具体而言,论文分别在部分观测和完全观测两种情况下推导了反馈条件:前者通过使拓扑从数据中不可解耦来实现隐私保护,后者则通过构造解空间迫使拓扑不准确。在此基础上,提出一种在有限隐私预算下的分布式拓扑修改设计,并建立了共识偏差与拓扑隐私之间的可控权衡,同时开发了一种低复杂度的启发式算法来优化已有边上的隐私保护程度。
链接: https://arxiv.org/abs/2605.15743
作者: Yushan Li,Jiabao He,Julien M. Hendrickx,Dimos V. Dimarogonas
机构: KTH Royal Institute of Technology (KTH皇家理工学院); UCLouvain (鲁汶大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注: 13 pages
Abstract:This paper develops a feedback-based method to preserve the topology privacy of consensus protocols in network systems. The key idea is to intentionally violate topology identifiability conditions, thereby preventing unique or accurate recovery of the true topology from available observations, while preserving the intended consensus behavior. This problem is challenging because the feedback magnitude directly reflects the privacy level of edges, while it is strongly coupled with the consensus convergence and constrained by local communications at each node. To begin with, we derive the feedback conditions of both partial and full observation cases, where the topology unsolvability from observation data is characterized in the former, and the solution space that enforces topology inaccuracy from data is constructed in the latter. Then, we propose a novel distributed topology modification design under limited privacy budgets, and establish the performance guarantees through a controllable tradeoff between the consensus deviation and the topology privacy. Finally, we develop a low-complexity heuristic algorithm to achieve optimal privacy preservation on existing edges. Comparative simulations validate the effectiveness and outperformance of the proposed preservation design.
[MA-8] Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback
【速读】:该论文旨在解决无限时域下基于人类偏好的网络化多智能体强化学习(NMARL)问题,其中智能体在具有局部状态依赖性的底层网络上交互,但现有偏好反馈方法依赖集中式训练且仅适用于单智能体场景,缺乏可扩展性。解决方案的关键在于引入一种基于时空截断轨迹的新型人类反馈机制,即每个智能体从其(\kappa)-跳邻域内聚合的(H)时域轨迹对;并在此基础上提出一种完全分布式的零阶策略梯度算法,其中每个智能体利用由当前联合策略和零均值高斯分布扰动联合策略生成的人类偏好反馈来估计局部策略梯度,且反馈仅依赖邻域内的状态-动作信息,无需显式奖励信号或集中控制。该算法被证明能以多项式样本复杂度收敛到(\epsilon)-驻点。
链接: https://arxiv.org/abs/2605.15697
作者: Pengcheng Dai,He Wang,Dongming Wang,Jian Qin,Wenwu Yu
机构: Singapore University of Technology and Design (新加坡科技设计大学); Southeast University (东南大学); University of California, Riverside (加州大学河滨分校); Sichuan University (四川大学); Purple Mountain Laboratories (紫金山实验室)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:We study a networked multi-agent reinforcement learning (NMARL) problem with human feedback in an infinite-horizon setting, where agents interact over an underlying network with localized state dependencies and aim to collaboratively maximize the average discounted return. Existing approaches with preference feedback are primarily developed for single-agent settings and rely on centralized training, which limits their scalability and applicability to large-scale networked multi-agent systems. To address this, we introduce a novel human feedback mechanism based on spatiotemporally truncated trajectories, defined as H -horizon trajectory pairs aggregated over each agent’s \kappa -hop neighborhood. Building on this, we develop a distributed zeroth-order policy gradient algorithm, where each agent estimates its local policy gradient using human preference feedback generated from both the current joint policy and a perturbed joint policy drawn from zero-mean Gaussian distribution. Specifically, the algorithm is fully distributed, as the feedback received by each agent depends solely on the state-action information within its \kappa -hop neighborhood and does not require explicit reward signals or centralized control. We further rigorously establish that the proposed algorithm converges to an \epsilon -stationary point with polynomial sample complexity. Finally, simulation results in a stochastic GridWorld environment and a predator-prey environment further demonstrate that the effectiveness and scalability of the proposed algorithm in achieving collaborative optimization based solely on human preference feedback.
[MA-9] Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)中现有并行和串行协作模式无法同时最小化通信延迟与最大化最终响应准确性的矛盾。解决方案的关键在于提出一种名为Nexa的混合范式,其核心是一个可训练的响应条件策略(response-conditioned policy):系统首先对所有智能体进行并行执行,将其生成的响应嵌入共享语义空间,然后利用一个轻量级Transformer模型预测一个稀疏有向无环通信图(sparse directed acyclic communication graph);若预测图为空则保持纯并行模式,否则执行一轮顺序消息传播。该策略无需外部大模型评判器(LLM judges)或奖励模型,也无需人工设计的测试时拓扑搜索,而是通过策略梯度优化(policy-gradient optimization)进行端到端训练,且训练后的策略具有良好的泛化性,可迁移到不同智能体数量、任务或底层代理的场景。
链接: https://arxiv.org/abs/2605.15573
作者: Nurbek Tastan,Alex Iacob,Lorenzo Sani,Meghdad Kurmanji,Nicholas D. Lane,Samuel Horvath,Karthik Nandakumar
机构: MBZUAI, UAE (穆罕默德·本·扎耶德人工智能大学); University of Cambridge, UK (剑桥大学); Flower Labs, UK (Flower实验室); Michigan State University, USA (密歇根州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.
[MA-10] ask-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking
【速读】:该论文试图解决多自主水下航行器(AUV)集群在执行持久水下目标跟踪任务时,缺乏统一的开源平台来评估不同多智能体强化学习(MARL)算法在六自由度(six-degree-of-freedom)AUV动力学下的性能,以及现有策略仅使用原始几何状态和低级力动作训练,难以有效表征任务阶段、观测可靠性、链路质量和局部协作角色等高级语义信息的问题。解决方案的关键在于:(1)开发了一个开源平台,将DI-engine训练框架与物理建模的六自由度水下AUV目标跟踪模拟器集成,并提供统一的实验协议以公平训练、测试和比较典型的强化学习(RL)和MARL算法;(2)基于该平台提出了STG-MAPPO算法,通过构建语义策略输入(包括跟踪诊断、任务阶段、观测置信度、链路可用性、邻居跟踪质量和局部角色优势),利用紧凑的语义任务图(Semantic Task Graph)将通信受限的网络状态与去中心化actor决策相链接,并引入速度级动作抽象(velocity-level action abstraction)将高级协作决策映射为可执行的六自由度AUV控制指令,从而提升策略在声学受限环境下的鲁棒性与协作效率。
链接: https://arxiv.org/abs/2605.15528
作者: Shengchao Zhu,Guangjie Han,Chuan Lin,Yu He
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Autonomous underwater vehicle (AUV) swarms are emerging as intelligent underwater networks, where each node must sense, communicate, process local data, and make decisions under severe acoustic constraints. Persistent underwater target tracking is a typical task with moving targets, changing communication topology, intermittent acoustic links, and limited observation for each AUV. Multi-agent reinforcement learning (MARL) is a natural candidate for distributed tracking, yet existing studies still lack a unified open-source platform for evaluating different MARL algorithms under six-degree-of-freedom AUV dynamics. In addition, policies trained with raw geometric states and low-level force actions often struggle to represent task phases, observation reliability, link quality, and local cooperation roles. This paper addresses these issues by developing an open-source MARL-AUV platform that integrates DI-engine with a six-degree-of-freedom underwater AUV target-tracking simulator. To the best of our knowledge, it is the first open platform that connects a public MARL training framework with physically modeled AUV swarm-based tasks, and provides a unified experimental protocol for fair training, testing, and comparison of representative RL and MARL algorithms. Based on this platform, we propose STG-MAPPO, a Semantic Task Graph-enhanced variant of Multi-Agent Proximal Policy Optimization. STG-MAPPO builds semantic policy inputs from tracking diagnostics, task phases, observation confidence, link availability, neighbor tracking quality, and local role advantage. A compact semantic task graph links communication-constrained network states to decentralized actor decisions, and a velocity-level action abstraction maps high-level cooperative decisions to executable six-degree-offreedom AUV control this http URL code is available at this https URL.
[MA-11] Estimated Dynamic Equilibrium Model: Supply and Demand as a Sample Path of a Stochastic Process
【速读】:该论文旨在解决传统经济学模型无法解释的市场持续非均衡(persistent disequilibrium)和资产价格泡沫现象,特别是当投资者并非非理性或过度乐观时,价格为何仍会呈现指数级增长。其核心问题在于弥合经验文献中关于分歧观点(divergence of opinion)如何影响资产定价的矛盾发现。解决方案的关键在于识别并形式化一种统计偏差——顺序统计偏差(order-statistic bias):当市场出清价格从噪声出价(noisy bids)分布的上尾(upper tail)顺序采样,并被循环作为未来估值输入时,尽管单期估计误差的期望为零,但价格期望仍会向上漂移。该机制通过i.i.d.均匀出价的闭式推导和跨时期复合模拟得到验证,无需假设投资者乐观或非理性,从而将Miller的静态分歧理论扩展为动态框架,并揭示出单一代理规则集即可涌现出从带状稳定到失控泡沫的六种不同制度。此外,该框架还警告机器学习估值算法可能无意中放大这一固有偏差。
链接: https://arxiv.org/abs/2605.15472
作者: Mikhail L. Arbuzov,Sisong Bei,Alexey Shvets
机构: Independent Researcher (独立研究员); Palo Alto Networks (帕洛阿尔托网络)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:We introduce the Estimated Dynamic Equilibrium Model (EDEM), an agent-based framework that treats supply and demand as a coupled stochastic process driven by heterogeneous, noisy agent valuations. The model’s primary technical contribution is the identification of a generative mechanism for persistent disequilibrium: when market-clearing prices are sequentially sampled from the upper tail of noisy bid distributions and recycled as inputs for future valuations, expected prices drift upward despite strictly zero-mean estimation errors. We derive this order-statistic bias in closed form for i.i.d. uniform bids and use simulations to show that compounding this bias across epochs yields exponential price growth without requiring assumptions of investor optimism or irrationality. This framework extends Miller’s divergence-of-opinion theory to a dynamic setting, recovering Walrasian equilibrium and Miller’s static premium as limiting cases. Through controlled experiments and sensitivity analysis on a simulated real-estate neighborhood, we identify six distinct regimes-ranging from band-stability to runaway bubbles-emerging from a single agent ruleset. These results offer a potential explanation for the contradictory findings in the empirical divergence-of-opinion literature and suggest that machine-learning valuation algorithms may inadvertently amplify this inherent statistical bias.
[MA-12] Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation
【速读】:该论文试图解决现有基于大语言模型(LLM,Large Language Model)的智能体在模拟审议性互动(如协商、冲突解决、多轮意见交换)时,生成的对话记录无法揭示智能体立场变化根本原因的问题——立场变化可能源于证据吸收、锚定(anchoring)、角色漂移、回声效应或提示与检索上下文的变化,但现有方法缺乏可审计性和可解释性。解决方案的关键在于引入信念引擎(Belief Engine, BE),这是一个可审计的信念更新层,它将“信念”定义为对某个命题的证据状态并表示为标量立场,通过将论点提取到结构化记忆中,并利用由证据吸收参数 u 和先前锚定参数 a 控制的对数几率规则(log-odds rule)来显式更新信念,从而保留完整的证据级更新轨迹。这使得立场动态可被参数化调控,并能将开放性、承诺、趋同和分歧等行为归因于明确的更新假设,而非隐藏的提示效应。
链接: https://arxiv.org/abs/2605.15343
作者: Joshua C. Yang,Maurice Flechtner,Damian Dailisan,Michiel A. Bakker
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:LLM-based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi-turn opinion exchange. Yet generated transcripts often do not reveal why an agent’s stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context. We introduce the Belief Engine (BE), an auditable belief-update layer that treats “belief” as an evidential state over a proposition and exposes it as scalar stance. BE extracts arguments into structured memory and updates stance with a log-odds rule controlled by evidence uptake u and prior anchoring a. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence-level update trail. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence-opposed cases instead point to anchoring or factors outside the extracted evidence stream. BE provides configurable infrastructure for studying evidence-grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects.
[MA-13] SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
【速读】:该论文旨在解决LLM驱动的程序进化(program evolution)在自动化科学发现中缺乏原则性设计指导且无法保证搜索收敛的问题。解决方案的关键在于提出SMCEvolve框架,该框架将程序搜索重新形式化为从奖励倾斜目标分布(reward-tilted target distribution)中采样,并采用序贯蒙特卡洛(Sequential Monte Carlo, SMC)采样器进行近似。基于这一视角,三个核心机制被推导为原则性组件:自适应父代重采样(adaptive parent resampling)、混合变异与接受(mixture of mutation with acceptance)以及自动收敛控制(automatic convergence control)。此外,论文还提供了有限样本复杂度分析,界定了达到目标近似误差所需的LLM调用预算。
链接: https://arxiv.org/abs/2605.15308
作者: Jiachen Jiang,Huminhao Zhu,Zhihui Zhu
机构: Department of Computer Science and Engineering (计算机科学与工程系); The Ohio State University (俄亥俄州立大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:LLM-driven program evolution has emerged as a powerful tool for automated scientific discovery, yet existing frameworks offer no principled guide for designing their individual components and provide no guarantee that the search converges. We introduce SMCEvolve, which recasts program search as sampling from a reward-tilted target distribution and approximates it with a Sequential Monte Carlo (SMC) sampler. From this view, three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control. We further provide a finite-sample complexity analysis that bounds the LLM-call budget required to reach a target approximation error. Across math, algorithm efficiency, symbolic regression, and end-to-end ML research benchmarks, SMCEvolve surpasses state-of-the-art evolving systems while using fewer LLM calls under self-determined termination. The code is available at this https URL.
[MA-14] ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
【速读】:该论文旨在解决大语言模型智能体在依赖外部批评(critique)时无法内化指导、以及在移除批评后性能退化的问题,同时解决固定批评者无法持续提升反馈质量从而限制迭代自我改进的瓶颈。解决方案的关键在于提出基于强化学习的内化自我批评框架(ICRL),通过共享骨干网络联合训练求解器(solver)与批评者(critic),并引入三个核心机制:1)基于求解器后续性能增益奖励批评者,以激励生成可操作的反馈;2)分布校准重加权比率(distribution-calibration re-weighting ratio),选择性转移与求解器自身提示分布兼容的批评引导改进,缓解批评条件与无批评行为之间的分布偏移;3)角色分组优势估计(role-wise group advantage estimation)以稳定联合优化过程。最终实现求解器在无外部批评时自主改进,而非依赖条件性行为。
链接: https://arxiv.org/abs/2605.15224
作者: Jianbo Lin,Xiaomin Yu,Yi Xin,Yifu Guo,Zhuosong Jiang,Zhongqi Yue,Weishi Wang,Heqing Zou,Chengwei Qin,Hui Xiong
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Nanjing University(南京大学); Sun Yat-sen University(中山大学); National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学); SAP(思爱普); Microsoft Research(微软研究院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique’s guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver’s subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver’s own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at this https URL.
[MA-15] amTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination ICML2026
【速读】:该论文试图解决多智能体大语言模型(Multi-agent LLM)系统在顺序微调过程中因共享上下文分布偏移而导致的性能退化问题。具体而言,当依次更新团队中的每个智能体时,前一个更新会改变后续智能体所依赖的上下文分布,而若以缓存的历史轨迹(stale rollouts)评估后续更新,这种不匹配会加剧,形成作者称之为“复合占据偏移(compounding occupancy shift)”的结构性失效模式。解决方案之关键在于 TeamTR 框架,它通过在每个组件更新后重新采样轨迹(resampling trajectories)来消除陈旧评估误差,并结合每个智能体的散度控制(per-agent divergence control)将更新限制在信任区域(trust region)内,从而将评估惩罚从与智能体数量成二次方缩放降低为线性缩放,并为每次更新和每个阶段提供严格的改进下界。
链接: https://arxiv.org/abs/2605.15207
作者: Yi Xie,Siao Liu,Falong Fan,Yuanqi Yao,Yue Zhao,Bo Liu
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 9pages, Accepted at ICML2026
Abstract:Multi-agent LLM systems have shown promise for complex reasoning, yet recent evaluations reveal they often underperform single-model baselines. We identify a structural failure mode in sequential fine-tuning of shared-context teams: updating one agent shifts the team’s context distribution, and when subsequent updates are evaluated on cached rollouts, this mismatch compounds. We formalize this as the compounding occupancy shift and prove that stale-occupancy evaluation incurs a penalty that scales quadratically with the number of agents. In contrast, intermediate-occupancy evaluation reduces this to linear scaling. We propose TeamTR, a trust-region framework that resamples trajectories after each component update and enforces per-agent divergence control, yielding rigorous per-update and per-stage improvement lower bounds. Experiments show that TeamTR outperforms single-agent and sequential baselines with 7.1% on average, mitigates coordination regressions, and supports plug-and-play component replacement. Code is available at this https URL.
[MA-16] Agent 4POI: Agent ic Context-Conditioned Affordance Reasoning for Multimodal Point-of-Interest Recommendation
【速读】:该论文解决的是现有POI推荐系统使用静态预计算嵌入(pre-computed static embeddings)无法满足情境敏感排序的问题——例如同一咖啡馆在周一适合独自工作而周五晚上适合团体庆祝这类上下文依赖性需求,并形式化证明了在标准双线性评分(bilinear scoring)下无预计算编码器能满足此类需求。解决方案之核心在于彻底反转计算范式:在推理时(inference time)由一个四阶段LLM代理(LLM agent)动态生成情境条件多模态表示。具体关键包括:第一阶段(Phase 1)根据情境上下文生成动态可供性查询(affordance queries),第二阶段(Phase 2)执行五步跨模态思维链(cross-modal chain-of-thought)整合图像、评论和元数据证据,第三阶段(Phase 3)基于吉布森可供性理论(Gibsonian affordance theory)构建不确定性感知的可供性表示(uncertainty-aware affordance representation),第四阶段(Phase 4)通过语义缓存系统(semantic caching system)与用户偏好对齐以实现低延迟排序。这一设计使系统在上下文偏移下仅下降7.5%(最强基线下降16-17%),并在冷启动场景中超越最优内容基线2.4倍。
链接: https://arxiv.org/abs/2605.15203
作者: Jinze Wang,Yangchen Zeng,Tiehua Zhang,Lu Zhang,Yuze Liu,Yongchao Liu,Xingjun Ma,Zhu Sun
机构: Tongji University (同济大学); Swinburne University of Technology (斯威本科技大学); Southeast University (东南大学); Chengdu University of Information Technology (成都信息工程大学); Ant Group (蚂蚁集团); Fudan University (复旦大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We introduce Agent4POI, the first POI recommendation framework that generates context-conditioned multimodal representations at recommendation time, rather than relying on static POI embeddings pre-computed independently of context. Existing multimodal systems encode each POI once as a static embedding, a design that precludes reasoning about why the same cafe affords solo work on Monday but group celebration on Friday evening. We formally prove that no pre-computed encoder can satisfy context-sensitive ranking under standard bilinear scoring, motivating inference-time item-side representation. Agent4POI inverts this computation: given a situational context, a four-phase LLM agent generates dynamic, context-specific affordance queries (Phase 1) and executes a five-step cross-modal chain-of-thought over image, review, and metadata evidence (Phase 2). The resulting uncertainty-aware affordance representation is grounded in Gibsonian affordance theory. These cross-modal verdicts form a structured, uncertainty-adjusted affordance representation (Phase 3), which is aligned with user preferences via a semantic caching system for low-latency ranking (Phase 4). On three POI benchmarks and three evaluation configurations (standard, cold-start, context-shift), Agent4POI achieves a 23.2% relative gain over the strongest baseline and degrades by only 7.5% under context-shift versus 16–17% for the strongest baselines. In cold-start scenarios, Agent4POI outperforms the best content-based baseline by up to 2.4x, whereas ID-based methods fail to generalize.
[MA-17] MAxLM: Multi-Agent Language Model-Based Scheduling and Resource Allocation in MU-MIMO-OFDMA-Enabled Wireless Networks
【速读】:该论文旨在解决联合MU-MIMO-OFDMA传输模式下无线局域网(WLAN)上行链路调度接入(UL-SA)中的用户调度与资源分配(SRA)优化问题,以提升网络吞吐量。解决方案之关键在于提出一个多智能体(MA)框架,利用公开可用的预训练小/中型语言模型(xLM)实现自主SRA,并辅以AI辅助无线系统工程与研究(WiSER)平台来支撑该框架的运行。
链接: https://arxiv.org/abs/2605.16144
作者: Adnan Quadri,Hongxiang Li
机构: University of Louisville (路易斯维尔大学)
类目: ignal Processing (eess.SP); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Wireless networks support multi-user (MU) communication with multiple-input multiple-output (MIMO) and orthogonal frequency-division multiple access (OFDMA) technologies. In the joint MU-MIMO-OFDMA-enabled transmission mode, network throughput can be significantly increased by effectively utilizing the multi-channel resources to schedule numerous wireless users/stations (STAs) simultaneously. In this paper, we study ways to optimize the user scheduling and resource allocation (SRA) for the UL scheduled access (UL-SA) of a joint MU-MIMO-OFDMA-enabled wireless local area network (WLAN). In particular, we propose a multi-agent (MA) framework that utilizes an openly available pretrained small/medium-sized Language Model (xLM) to perform SRA for the UL-SA. To facilitate autonomous SRA using our proposed technique, we introduce the AI-assisted Wireless Systems Engineering and Research (WiSER) platform. We evaluate the performance of MAxLM-optimized SRA for network scenarios with a varying number of STAs and antenna settings on the WLAN Access Point. Numerical results confirm that our proposed technique achieves higher UL-SA throughput than the benchmark techniques.
自然语言处理
[NLP-0] A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation
【速读】: 该论文旨在解决配电公司在账单透明化、碳排放可追溯性以及负荷调度与电网约束协同方面的挑战。具体问题包括:提供客户可理解的自然语言账单、为每千瓦时(kWh)赋予可辩护的碳排放数值,以及根据电网压力与排放约束进行日前负荷调度。解决方案的关键在于提出一个端到端框架,该框架将四种生产级能力统一于同一架构下:一个受约束解码策略约束的生成式AI(Generative AI)代理,能够从结构化数值输入中起草客户的自然语言账单;以及一个基于Transformer的预测器,可提供具有校准分位数带的日前用电量估计。
链接: https://arxiv.org/abs/2605.16250
作者: Pavan Manjunath,Thomas Pruefer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Distribution utilities are now expected to deliver bills that customers can actually read attach a defensible carbon number to every kWh sold and schedule load against grid stress and emissions constraints We propose an end-to-end framework that unifies four production-grade capabilities under one architectural roof a generative-AI agent that drafts each customers natural-language billing statement from structured numeric inputs under a constrained decoding policy a transformer-based forecaster that supplies the day-ahead consumption estimate with calibrated quantile bands
[NLP-1] AI-Mediated Communication Can Steer Collective Opinion
【速读】: 该论文试图解决生成式人工智能(Generative AI)在调解人际沟通时如何影响集体意见形成的问题,具体而言,填补了先前研究对AI在人类间通信中介中引入方向性偏见及其通过网络放大并改变整体舆论的机制缺乏关注的空白。解决方案的关键在于:首先,通过实证分析证明多种主流大语言模型(LLMs)在编辑争议性文本时表现出系统性方向偏差(如偏向枪支管控、反对无神论);其次,构建了一个意见动力学数学模型,将AI系统建模为社交网络中用户之间意见表达与感知的转换器,通过解析该模型的均衡状态并基于真实社交网络数据进行模拟,揭示AI偏见如何通过网络拓扑放大并驱动集体意见偏移;最后,通过对社交媒体平台X的“解释此帖子”(Explain this post)功能进行审计,识别出具体设计选择(如支持生命的偏见)导致的偏差,从而验证了模型的可控性并追溯了偏见的来源。
链接: https://arxiv.org/abs/2605.16245
作者: Stratis Tsirtsis,Kai Rawal,Chris Russell,Brent Mittelstadt,Sandra Wachter
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users’ posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals’ opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the “Explain this post” feature on X and find evidence of pro-life bias in Grok’s outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.
[NLP-2] Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find
【速读】: 该论文试图解决在评估 transformer 层等价性时,因混淆替换测试(replacement)与互换测试(interchange)这两种 swap-KL 探针而产生的协议差异(protocol gap)问题。这两种测试虽然在形式上均为基于输出的 KL 散度衡量,但在预训练模型上往往给出不一致的层剪枝安全性评估,导致同一评估器下安全可剪枝的层数差异可达数倍。解决方案的关键在于:在执行层移除或合并之前,对目标检查点同时计算两种 swap-KL 值(即 replacement KL 和 interchange KL),并利用这一诊断信息——仅需无标签前向传播——来更准确地判断哪些层真正适合剪枝,从而避免因单一协议偏差而做出错误决策。
链接: https://arxiv.org/abs/2605.16234
作者: Gabriel Garcia
机构: Independent Researcher(独立研究者)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 40 pages, 8 figures, 24 tables. Code and frozen JSON logs are not public during write-up; the authors plan to open this https URL
Abstract:When researchers ask whether two transformer layers are “equivalent” for compression, they often conflate distinct tests. Replacement asks whether one layer’s map can substitute for another’s in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes. Comments: 40 pages, 8 figures, 24 tables. Code and frozen JSON logs are not public during write-up; the authors plan to open this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2605.16234 [cs.LG] (or arXiv:2605.16234v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-3] A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution Utility Billing Carbon Analytics and Quantum-Inspired Optimisation
【速读】: 该论文聚焦于智能计量(smart metering)、生成式人工智能(generative AI)以及量子启发组合优化(quantum-inspired combinatorial optimisation)加速融合背景下,能源公用事业公司在管理物理基础设施、客户互动和环境责任方面面临的新挑战。其解决方案的关键在于利用生成式AI的先进数据生成与预测能力,结合量子启发算法的高效组合优化,实现智能计量数据的深度分析与决策支持,从而动态协调基础设施运营、个性化客户需求响应与碳排放管控目标。
链接: https://arxiv.org/abs/2605.16232
作者: Pavan Manjunath,Thomas pruefer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:The accelerating convergence of smart metering, generative artificial intelligence, and quantum-inspired combinatorial optimisation is reshaping how energy utilities manage physical infrastructure, customer engagement, and environmental accountability
[NLP-4] Artificial Aphasias in Lesioned Language Models
【速读】: 该论文旨在通过模仿人类失语症的脑损伤研究方法,揭示大型语言模型(LMs)内部功能组织的表征规律,具体回答模型不同组件(如注意力机制中的query、key、value、output以及前馈网络中的up、gate、down)和不同深度层在语言处理中的因果作用,并比较其与人类失语症症状谱的异同。解决方案之关键在于提出一种“损伤”(lesion)技术,即通过将特定模型参数置零来模拟局部功能缺失,并利用文本失语症测试组(Text Aphasia Battery,TAB)对模型输出进行症状诊断,从而定量关联参数损伤与特定语言缺陷(如句法、语义、音韵、流利度),最终揭示LM中组件与深度在语言功能组织上的特异性分布。
链接: https://arxiv.org/abs/2605.16222
作者: Nathan Roll,Jill Kries,Laura Gwilliams,Cory Shain
机构: Department of Linguistics, Stanford University (斯坦福大学语言学系); Department of Psychology, Stanford University (斯坦福大学心理学系); Wu Tsai Neurosciences Institute, Stanford University (斯坦福大学吴蔡神经科学研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 49 pages, 13 figures
Abstract:Aphasias, selective language impairments which can arise from brain damage, reveal the functional organization of human language by providing causal links between affected brain regions and specific symptom profiles. Drawing on this literature, we introduce an aphasia-inspired technique to characterize the emergent functional organization of language models (LMs). We ``lesion’’ (zero-out) model parameters and measure the effects of this intervention against clinical aphasia symptoms, as diagnosed by the Text Aphasia Battery (TAB). When applied to 112,426 outputs from five 1B-scale LMs, the full range of evaluated symptoms surface, but in distributions largely distinct from those of humans. Our method uncovers broad symptom-profile differences between attention components (query, key, value, output) and feed-forward components (up, gate, down), with weaker evidence for differences among components within the same mechanism. We also find an effect of depth, where lesions in early layers disproportionately cause syntactic and semantic symptoms while late-middle layers yield higher rates of phonological and fluency deficits. Although some LM lesions induce quantitatively more similar profiles to some human aphasia types than others, qualitative differences in symptom patterns between LMs and humans suggest that aphasia syndromes are heavily influenced by the details of learning and processing rather than being a domain-invariant consequence of disrupted language processing.
[NLP-5] Fully Open Meditron: An Auditable Pipeline for Clinical LLM s
【速读】: 该论文试图解决当前基于大语言模型(LLM)的临床决策支持系统(CDSS)缺乏透明度和可审计性的问题:现有模型多为仅开放权重的“开放模型”,其数据来源、整理流程和生成管道均不公开,导致系统不透明且难以进行严格的复现验证。解决方案之关键在于首次提出了完全开放(Fully Open, FO)的管道Fully Open Meditron,它端到端地暴露完整训练栈,具体包括:一个经临床医生审计的训练语料库(统一8个公开医学QA数据集,并通过三种临床医生审核的合成扩展——考试型QA、基于46,469个临床实践指南的指南型QA以及临床 vignettes 来扩大覆盖范围);一个可复现的数据构建和训练框架(实施系统级数据去污染、教师生成的黄金标签重采样,以及由四名医生组成的小组进行端到端验证);以及一个与使用场景对齐的评估协议(基于专家编写的临床 vignettes,采用LLM-as-a-judge方法,并针对204名人类评分者校准)。该管道的核心特点是完全开放性,使得无需牺牲可审计性和可复现性即可实现领域特定的最先进性能。
链接: https://arxiv.org/abs/2605.16215
作者: Xavier Theimer-Lienhard,Mushtaha El-Amin,Fay Elhassan,Sahaj Vaidya,Victor Cartier-Negadi,David Sasu,Lars Klein,Mary-Anne Hartley
机构: LiGHT, EPFL (LiGHT实验室, 瑞士联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. 31 pages, 10 figures. Code, models, and data: this https URL
Abstract:Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most “open” models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.
[NLP-6] Confirming Correct Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
【速读】: 该论文试图解决的核心问题是:在基于大语言模型(LLM)的智能辅导系统中,模型能否精准区分学生的不同解题步骤——即最优解、有效但次优解以及错误解——从而提供自适应辅导,而这一诊断能力目前尚未得到系统验证。解决方案的关键在于通过构建一个基于知识图谱(Knowledge Graph, KG)的地面真值基准,对七种LLM反馈智能体在命题逻辑场景中的诊断精度进行大规模评估(10,836个解-反馈对,三种反馈条件)。研究发现,尽管模型在最优步骤上表现接近天花板,但在区分有效次优解与错误解时存在系统性偏差(过度拒绝次优、过度验证错误),且这一偏差源于架构限制而非信息缺失。因此,论文提出的关键解决方案是采用混合架构:由基于知识图谱的模型负责精确诊断,而LLM则专门用于生成开放式脚手架和对话支持,从而弥合诊断判断与教学有效性之间的鸿沟。
链接: https://arxiv.org/abs/2605.16207
作者: Tahreem Yasir,Wenbo Li,Sam Gilson,Sutapa Dey Tithi,Xiaoyi Tian,Tiffany Barnes
机构: North Carolina State University (北卡罗来纳州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 20 fgures
Abstract:Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution–feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.
[NLP-7] Improving Cross-Cultural Survey Simulation with Calibrated Value Personas IJCAI ECAI2026
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化场景下模拟人类意见和调查回应时,现有基于社会人口学或人格特质的人设方法(persona-based prompting)无法准确复制人群回应的问题,尤其对训练数据中代表性不足的人群表现更差。解决方案之关键包括两点:一是提出基于价值观(value-based)的人设构建方法,从调查回应中提取描述核心文化维度的文本描述符,通过从目标人群中采样价值观档案(value profiles)并聚合LLM跨人设的回应,获得基于观察到的价值观分布的人群层面预测;二是引入一个校准程序(calibration procedure),在保持估计意见的同时提高回应多样性,从而缩小与LLM先验一致的国家和代表性不足国家之间的性能差距,使回应分布更接近人类多样性。
链接: https://arxiv.org/abs/2605.16193
作者: Axel Abels,Elias Fernandez Domingos,Apurva Shah,Tom Lenaerts
机构: Machine Learning Group, Université Libre de Bruxelles (机器学习组,布鲁塞尔自由大学(法语)); AI Lab, Vrije Universiteit Brussel (人工智能实验室,布鲁塞尔自由大学(荷语)); FARI Institute, Université Libre de Bruxelles - Vrije Universiteit Brussel (FARI研究所,布鲁塞尔自由大学(法语) - 布鲁塞尔自由大学(荷语)); Center for Human-Compatible AI, UC Berkeley (人类兼容AI中心,加州大学伯克利分校); ELLIS Alicante (ELLIS阿利坎特)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Submitted to the Fourth International Workshop on Value Engineering in AI (VALE 2026), held at IJCAI-ECAI 2026
Abstract:Large language models (LLMs) are increasingly used to simulate human opinions and survey responses, but their ability to reproduce population responses across cultures remains limited. Existing persona-based prompting methods typically rely on sociodemographic or personality traits, which are only indirect proxies for the values that shape human responses. We propose a value-based persona construction method that derives textual descriptors from survey responses capturing core cultural dimensions. By sampling value profiles from target populations and aggregating LLM responses across personas, we obtain population-level predictions grounded in observed value distributions. We further introduce a calibration procedure that improves response diversity while preserving estimated opinions. We show that our approach reduces prediction error across countries, with the largest improvements observed in underrepresented populations. This substantially narrows the performance gap between countries aligned with dominant LLM priors and those that are less represented in training data, while also yielding response distributions that closely match human diversity.
[NLP-8] Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search
【速读】: 该论文旨在解决传统平板太阳能电池在中纬度地区因倾角固定而导致的能量损耗问题,通过自主生成高效的三维光伏(3DPV)结构来提升日间能量密度。解决方案的关键在于将通用编码智能体(AntiGravity)与大语言模型驱动的树搜索算法(ERA)相结合,形成一个迭代优化的工作流:首先用AntiGravity复现现有3DPV计算模型作为起点,然后通过大规模树搜索生成候选设计并以日产量作为评分函数进行筛选;针对搜索初期出现的奖励黑客(reward hacking)问题——如非物理的悬浮结构或对光学求解器离散化的利用——工作流进一步引入编码智能体自动修补物理引擎的约束条件,从而消除这些伪高效设计。最终,这一流程能够在多种约束下发现优化后的三维光伏结构,包括固定集热面积的优化、天顶角跟踪调整以及自阴影规避,从而验证了结合编码智能体与树搜索方法在可经验评分的科学发现任务中的潜力。
链接: https://arxiv.org/abs/2605.16191
作者: Michael P. Brenner,Lizzie Dorfman,John C. Platt
机构: Google Research; School of Engineering and Applied Sciences, Harvard University (工程与应用科学学院,哈佛大学)
类目: Computation and Language (cs.CL); Other Condensed Matter (cond-mat.other); Computational Physics (physics.comp-ph)
备注: 10 pages 7 figures
Abstract:We present a case study for how AI coding systems can be used to generate novel scientific hypotheses. We combine a generic coding agent (Google’s AntiGravity) with an LLM-driven tree search algorithm (Empirical Research Assistance / ERA) to autonomously generate high-efficiency three-dimensional photovoltaic (3DPV) structures that overcome losses limiting flat solar panels at mid-latitudes. These structures operate by presenting favorable angles to the sun throughout the day, and for illustrative purposes we focus on optimizing performance for a single solar day. Our workflow begins by using AntiGravity to reproduce calculations \citebernardi2012solar showing that 3DPV can have energy densities much higher than stationary flat PV panels. We use these initial designs as the starting point for large scale tree search, where we seek improved solutions and score them for their diurnal yield. The initial tree search leads to nominally more efficient solutions, yet they are caused by algorithmic reward hacking, arising from non-physical design features such as structurally levitating disconnected tiers and exploitations of the discretizations in the optics solver. To counteract this, we develop a workflow where the coding agent iteratively patches the physics engine with constraints to eliminate reward hacking. With reward-hacking eliminated, ERA discovers a series of designs with various constraints and improved performance, including optimal designs with different fixed collector areas, optimizing zenith tracking and avoiding self shadowing. Combining coding agents with tree search (ERA) provides a powerful platform for scientific discovery, for problems whose solutions can be empirically evaluated with a score function. Comments: 10 pages 7 figures Subjects: Computation and Language (cs.CL); Other Condensed Matter (cond-mat.other); Computational Physics (physics.comp-ph) Cite as: arXiv:2605.16191 [cs.CL] (or arXiv:2605.16191v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.16191 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-9] Look Before You Leap: Autonomous Exploration for LLM Agents
【速读】: 该论文试图解决的问题是:基于大语言模型的智能体(agents)在陌生环境中因“过早利用(premature exploitation)”而失败,即在未充分获取环境特定信息的情况下依赖先验知识采取行动,导致其无法有效适应新环境。解决方案的关键在于:首先,提出可验证的度量标准——探索检查点覆盖率(Exploration Checkpoint Coverage),用于量化智能体发现关键状态、对象和可供性(affordances)的广度;其次,开发一种交替训练策略,将任务执行展开(task-execution rollouts)与探索展开(exploration rollouts)交织进行,每种展开由各自对应的可验证奖励优化;最后,基于该训练策略提出“先探索后行动(Explore-then-Act)”范式,将信息收集与任务执行解耦——智能体先利用交互预算获取具体的环境知识,再将其用于任务解决。通过系统性地学习探索,智能体能够构建泛化能力强且可实际部署的能力。
链接: https://arxiv.org/abs/2605.16143
作者: Ziang Ye,Wentao Shi,Yuxin Liu,Yu Wang,Zhengzhou Cai,Yaorui Shi,Qi Gu,Xunliang Cai,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.
[NLP-10] SGR: A Stepwise Reasoning Framework for LLM s with External Subgraph Generation
【速读】: 该论文试图解决大型语言模型(LLMs)在需要深度推理和逻辑推断的复杂场景中表现受限的问题,具体表现为生成过程中可能引入不相关、有噪声或事实不一致的内容。解决方案的关键在于引入SGR(Stepwise subGraph Reasoning)框架,该框架通过从外部知识库生成查询特定的子图(subgraph),并利用其语义结构支持多步推理,将中间推理步骤锚定(grounding)于结构化外部知识,从而引导模型专注于相关实体、关系和支撑证据,最终结合多个推理轨迹得到预测结果。
链接: https://arxiv.org/abs/2605.16117
作者: Xin Zhang,Yang Cao,Baoxing Wu,Kai Song,Siying Li
机构: School of Information Science and Engineering, Chongqing Jiaotong University (重庆交通大学信息科学与工程学院); School of Computer Science and Technology, Chongqing University of Posts and Telecommunications (重庆邮电大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.
[NLP-11] DebiasRAG : A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation
【速读】: 该论文旨在解决大语言模型(LLMs)中由训练语料导致的种族、性别、年龄等社会偏见(social biases)问题,并克服传统微调与提示工程方法需额外训练资源、可能损害模型原有能力以及缺乏动态去偏上下文的局限性。解决方案的关键在于提出一种基于检索增强生成(RAG)的无需微调(tuning-free)、动态查询特定(query-specific)去偏框架DebiasRAG,其核心机制包括三个阶段:首先,通过离线构建的偏见上下文(bias contexts)进行查询相关的自诊断(self-diagnosed)检索,并反向生成去偏上下文(debiasing contexts)作为公平性约束;其次,从常规RAG文档库中检索查询相关上下文以构建候选池;最后,采用梯度更新(gradient-updated)对去偏引导的上下文片段进行重排序(reranking),在保持LLM表征能力等内在性能的前提下,动态增强输出公平性。
链接: https://arxiv.org/abs/2605.16113
作者: Rui Chu,Bingyin Zhao,Thanh Quoc Hung Le,Duy Cao Hoang,Huawei Lin,Ping Li,Weijie Zhao,Khoa D Doan,Yingjie Lao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.
[NLP-12] Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
【速读】: 该论文试图解决机器生成文本(MGTs)检测中的核心挑战:基于度量的方法中,令牌级检测分数容易因MGTs生成过程的固有随机性而产生偏差,导致检测性能不稳定。解决方案的关键在于提出一个多级上下文令牌关系建模框架(multi-level contextual token relation modeling framework)。具体而言,针对局部关系,设计了一个轻量级的马尔可夫信息校准模块(Markov-informed calibration module),在聚合前细化令牌级证据;针对全局关系,引入了一个规则支持推理模块(rule-support reasoning module),利用上下文分数统计导出的显式逻辑规则。两者在联合多级推理框架中结合,从而显著提升跨大语言模型(cross-LLM)和跨领域(cross-domain)场景下的检测准确性,同时保持较低的计算开销。
链接: https://arxiv.org/abs/2605.16107
作者: Chenwang Wu,Yiuming Cheung,Bo Han,Shuhai Zhang,Defu Lian
机构: Department of Computer Science, Hong Kong Baptist University, Hong Kong, China(香港浸会大学计算机科学系); School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510000, China(华南理工大学软件工程学院); School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230000, China(中国科学技术大学计算机科学与技术学院); State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.
[NLP-13] Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM -Driven Data Augmentation for Cognitive Score Prediction
【速读】: 该论文旨在解决由于数据集规模有限和类别不平衡(class imbalance)导致的从自发语音中准确评估认知衰退(cognitive decline)的难题。其解决方案的关键在于提出了一种大语言模型(LLM)驱动的数据增强框架,具体利用GPT-5以参与者的书面回答作为语义锚点,生成多种风格的类口语独白,从而扩充训练数据;同时,采用语义引导的类别平衡选择(similarity-guided class-balanced selection)策略,优先选取与真实样本语义接近的合成数据,以此在保持多数类性能的同时显著降低少数低分样本的预测误差,最终实现更一致和稳定的预测改进。
链接: https://arxiv.org/abs/2605.16077
作者: Si-Belkacem Yamine Ketir,Lenard Paulo Tamayo,Shohei Hisada,Shaowen Peng,Shoko Wakamiya,Eiji Aramaki
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures
Abstract:Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.
[NLP-14] Reason ers or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)在法律推理任务中表现出的性能是否真正反映了其法律推理能力,还是仅仅源于数据污染(data contamination)带来的虚假提升。解决方案的关键在于:首先,通过实施污染检测协议(contamination detection protocol)严格评估LLM的可靠性,揭示污染对性能的夸大效应;其次,构建一个神经符号混合系统(neuro-symbolic framework),该系统将成文法条文翻译为形式化表示(formal representations),并将推理过程委托给符号求解器(symbolic solvers),从而替代单一LLM的端到端推理;最后,设计一套包含案例与规则变体的新型测试套件(test suite),专门用于探测模型对未见文档(unseen documents)的泛化能力。实验表明,法律推理本质上是组合性的(compositional),神经符号框架比单一LLM更可靠、更稳健,且泛化能力更强。
链接: https://arxiv.org/abs/2605.16052
作者: Parisa Kordjamshidi,Samer Aslan,Madhavan Seshadri,Leslie Barrett,Enrico Santus
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.
[NLP-15] RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents ACL2026
【速读】: 该论文试图解决现有长期运行智能体(agent)的记忆系统中,对每次交互都调用大型语言模型(LLM)进行记忆提取所导致的过高token消耗问题。解决方案的关键在于采用一种基于重复的延迟整合策略(recurrence-based lazy consolidation):将交互先存储在“潜意识记忆层”(subconscious memory)中,仅用轻量级嵌入模型进行检索,只有当观察到语义相似交互持续重复出现(即形成语义簇,包含丰富信息)时,才调用LLM从中提取情节性记忆(episodic memory)和语义性记忆(semantic memory);同时,引入语义细化机制(semantic refinement mechanism)来恢复记忆提取中被遗漏的细粒度事实,从而在保证甚至提升准确率的前提下,将三个现有最优记忆系统的记忆构建token消耗最高降低87%。
链接: https://arxiv.org/abs/2605.16045
作者: Zijie Dai,Shiyuan Deng,Sheng Guan,Yizhou Tian,Xin Yao,Xiao Yan,James Cheng
机构: Department of Computer Science and Engineering, The Chinese University of Hong Kong (香港中文大学计算机科学与工程系); Huawei Cloud (华为云); School of Computer Science, Beijing University of Posts and Telecommunications (北京邮电大学计算机学院); Huawei Theory Lab (华为理论实验室); Institute for Math and AI, Wuhan University (武汉大学数学与人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Findings
Abstract:Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.
[NLP-16] From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation ACL2026
【速读】: 该论文试图解决现有组合式语音到语音翻译(S2ST)系统在多语言适应时数据效率低下的问题,具体而言,当前系统往往忽略源语言信息或采用“语言作为标签”(language-as-label)的范式,将每种源语言编码为独立的扁平嵌入,忽略了语言间共享的系统性语言学结构,导致在监督数据稀缺时多语言适应能力受限。解决方案之关键在于提出了S2ST-Omni 2框架,系统性地将多语言条件化从扁平语言标签重构为结构化类型学先验(typological priors),具体通过三个层面实现:类型学感知的分层语言编码(typology-informed hierarchical language encoding)以构建结构化的源语言表示,动态门控的语言感知双CTC(dynamically-gated language-aware Dual-CTC)进行内容自适应的声学调制,以及类型学感知的大语言模型提示(typology-aware LLM prompting)以提供解码端的语言指导。这三个层面的策略互补,通过显式引入类型学先验,为数据高效的多语言S2ST提供了有效的归纳偏置。
链接: https://arxiv.org/abs/2605.16026
作者: Yu Pan,Yang Hou,Xiongfei Wu,Liang Zhang,Yves Le Traon,Lei Ma,Jianjun Zhao
机构: Kyushu University (九州大学); Recho Inc. (Recho公司); National Institute of Informatics (国立情报学研究所); University of Luxembourg (卢森堡大学); Donghua University (东华大学); The University of Tokyo (东京大学); University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE/ACM TASLP. This work extends S2ST-Omni, accepted to Findings of ACL 2026
Abstract:Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.
[NLP-17] Judge Circuits
【速读】: 该论文试图解决的问题是:在“大语言模型作为评判者”(LLM-as-a-judge)范式中,同一模型在不同的输出格式(例如1-5评分 vs. 真/假标签)下会产生系统性的评分不一致性,而现有诊断仅停留在输入输出层面,未能揭示其内部机制。解决方案的关键在于:利用位置感知边归因修补(Position-aware Edge Attribution Patching, PEAP)对Gemma-3、Qwen2.5和Llama-3等模型进行因果分析,发现跨结构化理解与开放式偏好任务中的判断共享一个稀疏的、泛化的潜在评估器子图(Latent Evaluator sub-graph),该子图位于中后段多层感知机(MLPs)中;零消融该子图会破坏判断能力但保留世界知识,表明抽象判断与输出格式在结构上解耦。由此揭示格式诱导不一致性的机制:连续的判断信号在共享主干中计算,通过脆弱的格式特定终端分支映射,从而可以在请求输出格式的下游分离出与格式无关的偏好。这一发现意味着基准层面的跨格式可靠性比较部分地测量了格式化器的几何特性而非评估质量。
链接: https://arxiv.org/abs/2605.16023
作者: Nils Feldhus,Tanja Baeumel,Elena Golimblevskaia,Qianli Wang,Van Bach Nguyen,Aaron Louis Eidt,Christopher Ebert,Wojciech Samek,Jing Yang,Vera Schmitt,Sebastian Möller,Simon Ostermann
机构: Technische Universität Berlin (柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data (BIFOLD – 柏林学习与数据基础研究所); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心 (DFKI)); Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); Marburg University (马尔堡大学); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心 (CERTAIN))
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 32 pages
Abstract:LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.
[NLP-18] Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study
【速读】: 该论文试图解决视觉语言模型(Vision Language Models, VLMs)在数学教学任务中缺乏针对不同学习者特征(如认知、动机和复杂性)的适应性评估框架的问题。解决方案的关键是借鉴自适应学习框架中的学习者模型(learner model),提出一个基于该模型的评估准则(rubric),将适应性的评估系统化为认知方面、动机方面和复杂性三个维度,并额外评估VLMs响应的正确性和质量。
链接: https://arxiv.org/abs/2605.16011
作者: Jie Gao,Yongan Yu,Junzhu Su,Yiran Lin,Adam K. Dube,Jackie Chi Kit Cheung
机构: McGill University (麦吉尔大学); Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能讲席教授)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Adaptive learning refers to educational technologies that track learners’ learning progress and adapt the instructional process based on individual learners’ learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.
[NLP-19] Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
【速读】: 该论文试图解决AI系统在跨文化评估中“文化能力”概念模糊、术语混用、评估局限于事实记忆的问题,即“构念模糊性 (Construct Ambiguity)”。解决方案的关键是借鉴跨文化交际学 (Intercultural Communication),提出一个三级分类法:文化意识 (Cultural Awareness) 回答“模型是否知道”、文化敏感性 (Cultural Sensitivity) 回答“模型如何构建知识”、文化能力 (Cultural Competence) 回答“模型能否随交互演变而适应”。这一分类法不仅澄清概念,还作为实用工具提升AI在真实多元文化环境中评估的有效性和可解释性,避免因构念不清导致高估模型能力或不当部署。
链接: https://arxiv.org/abs/2605.15990
作者: Isar Nejadgholi,Masoud Kianpour,Krishnapriya Vishnubhotla,Maryam Molamohamadi
机构: National Research Council, Canada (加拿大国家研究委员会); Toronto Metropolitan University, Canada (多伦多城市大学); Mila, Quebec AI Institute, Canada (Mila, 魁北克人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers “Does the model know?”, Cultural Sensitivity answers “How does it frame its knowledge?”, and Cultural Competence answers “Can it adapt as the interaction evolves?”. Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.
[NLP-20] Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports
【速读】: 该论文试图解决执法报告(law enforcement reports)中,大量关键事件细节以非结构化自然语言形式存在于书面叙述中,需要人工阅读才能提取和利用的问题。其目标是量化叙述文本的价值,从非结构化文本中恢复仅有的案件细节,并利用时间线索(time cues)和领域公理(domain axioms)构建时间图(temporal graphs)。解决方案的关键在于采用符号方法(symbolic methods),通过去除个人标识符(redacting personal identifiers)、语义解析(semantic parsing)、将谓词映射到本体(predicate mapping to ontology)以及推理(reasoning)这一系列步骤来实现自动转化。其中,核心路径是借助PropBank、VerbNet和WordNet构成的语义通路完成事件提取与映射,并辅以置信度评分,从而生成与证据链接的事实。
链接: https://arxiv.org/abs/2605.15978
作者: Anita Srbinovska,Jansen Orfan,Adrian Martin,Ernest Fokoué
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 13 pages, 8 figures, 9 tables
Abstract:Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank–VerbNet–WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.
[NLP-21] Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective
【速读】: 该论文试图解决现有强化学习方法主要聚焦于参数量≥7B的decoder-only大语言模型,而对中小规模encoder-decoder架构的机器翻译模型缺乏系统性研究的问题,具体而言,是在低资源语言场景下,如何利用强化学习微调来提升NLLB-200这类encoder-decoder模型的翻译质量,而不依赖昂贵的平行数据。解决方案的关键在于采用组相对策略优化(Group Relative Policy Optimization, GRPO)与混合无参考奖励(LaBSE和COMET-Kiwi)的组合,该奖励函数在微调阶段不需要任何平行语料,并通过对13种类型学多样语言的实验验证了其有效性。论文进一步揭示了一个经验规律:在基线性能最弱且奖励信号具有最高可区分性的语言上,该方案获得的增益最大,从而证明这种方法在平行数据最稀缺的低资源场景中最为高效。
链接: https://arxiv.org/abs/2605.15976
作者: Ernesto Garcia-Estrada,Carlos Escolano,José A. R. Fonallosa
机构: Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at \geq 7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to + 5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.
[NLP-22] owards Generalization of Block Attention via Automatic Segmentation and Block Distillation
【速读】: 该论文旨在解决块注意力机制(block attention)在长上下文场景(如检索增强生成 RAG)中应用的两大障碍:一是难以将输入文本自动分割成语义自洽且符合人类直觉的独立块;二是现有块微调方法(block fine-tuning)效率低下且易导致性能下降。解决方案的关键在于两个创新:首先,构建了一个包含超过3万实例、覆盖书籍、代码、网页文本和对话等16个类别、长度从2k到32k tokens的多样化语义分割数据集 SemanticSeg,并以此训练一个轻量级分割器(segmenter),能够自动将文本分割为粒度可控、符合人类直觉的块;其次,提出块蒸馏(block distillation)训练框架,通过冻结的全注意力教师模型(full-attention teacher)指导块注意力学生模型(block-attention student),该框架集成了三个新颖组件:块沉没标记(block sink tokens)以缓解块边界信息丢失、块丢弃(block dropout)以利用所有块的训练信号、以及 token 级损失加权(token-level loss weighting)以聚焦对块注意力敏感的 token 的学习。实验表明,该分割器优于启发式和统计基线,块蒸馏在块注意力下达到近乎全注意力的性能,为部署块注意力提供了实用且可扩展的路径。
链接: https://arxiv.org/abs/2605.15913
作者: Shuaiyi Li,Zhisong Zhang,Yan Wang,Lei Zhu,Dongyang Ma,Chenlong Deng,Yang Deng,Wai Lam
机构: The Chinese University of Hong Kong(香港中文大学); City University of Hong Kong(香港城市大学); Tencent(腾讯); Singapore Management University(新加坡管理大学); Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures
Abstract:Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.
[NLP-23] Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
【速读】: 该论文旨在解决威权政治背景下社会文本和图像数据长期缺乏的问题,具体针对俄罗斯政府的多模态政治传播数据不足。解决方案的核心是构建一个互相关联的多模态数据集,包含克里姆林宫和俄罗斯外交部高级官员数十年的官方演讲,提供俄英双语文本、相关图像及说明,并补充统一的元数据(如日期、演讲者、地点和官方标签)。同时,通过基于transformer的多模态主题模型生成并经由俄罗斯政治专家修正的标注,实现对这些演讲文本和图像的有效注释,从而支持多模态、多语言、时间及空间维度的分析。
链接: https://arxiv.org/abs/2605.15886
作者: Daria Blinova,Gayathri Emuru,Rakesh Emuru,Kushagradheer Shridheer Srivastava,Mina Rulis,Sunita Chandrasekaran,Benjamin E. Bagozzi
机构: University of Delaware, Department of Political Science International Relations (特拉华大学政治学与国际关系系); University of Delaware, Masters of Science in Data Science Program (特拉华大学数据科学硕士项目); University of Delaware, Department of Computer Information Sciences (特拉华大学计算机与信息科学系); University of Pennsylvania, Department of Political Science (宾夕法尼亚大学政治学系)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.
[NLP-24] Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination ICML2026
【速读】: 该论文主要探究视觉语言模型(Vision-Language Models, VLMs)在推理过程中产生的自我反思语句(如“let me check the figure again”)是否真正引发了视觉重检查,抑或仅仅是模型学习到的文本模式。针对这一问题,论文提出的关键解决方案是VisualSwap图像交换探测框架:在模型对一张图像进行推理后,将其替换为视觉相似但语义不同的图像,通过检测模型是否注意到这种变化来评估其真实的视觉感知能力。作者构建了VS-Bench基准数据集(包含来自多个数据集的800个图像对),并在Qwen3-VL、Kimi-VL和ERNIE-VL上的实验显示,模型几乎完全未能察觉图像交换,准确率下降高达60%。注意力分析进一步揭示了为何用户指令能显著提升对视觉标记的注意力从而恢复视觉基础,而模型自我生成的反思语句却无法达到相同效果,说明当前VLMs在声称进行视觉重检查时,更多是“说”而非真正“看”。
链接: https://arxiv.org/abs/2605.15864
作者: Chufan Shi,Cheng Yang,Yaokang Wu,Linhao Jin,Bo Shui,Taylor Berg-Kirkpatrick,Xuezhe Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICML 2026 Spotlight
Abstract:Vision-Language Models (VLMs) often produce self-reflective statements like “let me check the figure again” during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: this https URL
[NLP-25] ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation
【速读】: 该论文旨在解决当前机器翻译系统在处理PDF文档时缺乏空间定位与几何同步能力,导致文本与其视觉上下文(如嵌套表格、公式等)之间的关联丢失,无法实现高保真文档重建的问题。解决方案的关键在于构建了ForMaT(Format-Preserving Multilingual Translation)语料库,该语料库包含15种语言对的3,956个PDF文档,并保留了原始布局元数据;同时,通过K-Medoids算法对45个几何特征进行采样,以确保数据集中包含视觉上多样化的复杂结构元素,从而为开发能够整合视觉和文本上下文的布局感知翻译模型提供基准。
链接: https://arxiv.org/abs/2605.15794
作者: Michał Ciesiółka,Dawid Wiśniewski,Adrian Charkiewicz,Kamil Guttmann
机构: Laniqo, Poznań, Poland (拉尼科, 波兹南, 波兰); Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poznań, Poland (亚当·密茨凯维奇大学数学与计算机科学学院, 波兹南, 波兰); Poznań University of Technology (波兹南理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.
[NLP-26] CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLM s
【速读】: 该论文旨在解决当前机器翻译质量评估(Quality Estimation, QE)中依赖大规模专有大语言模型(LLMs)所引发的数据隐私问题。解决方案的关键在于:采用参数规模仅为30B的较小开源LLM,并通过单次提示(single-pass prompting)策略,使其能够同时生成质量分数、MQM错误注释、建议的错误修正以及完整后期编辑,从而在系统级相关性上超越传统神经指标、微调模型及人类注释者间一致性,有效逼近更大专有LLM的性能,同时实现成本效益和隐私保护。
链接: https://arxiv.org/abs/2605.15763
作者: Kamil Guttmann,Zofia Fraś,Artur Nowakowski,Krzysztof Jassem
机构: Laniqo, Poznań, Poland(拉尼科,波兹南,波兰); Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poznań, Poland(数学与计算机科学学院,亚当·密茨凯维奇大学,波兹南,波兰)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.
[NLP-27] DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
【速读】: 该论文旨在解决大语言模型(LLM)智能体在长期记忆系统中面临的忠实度-效率权衡问题:直接存储原始对话历史虽然保真但成本高昂,而采用扁平化事实或摘要则会丢失精确回忆所需的结构信息。解决方案的关键在于提出DimMem,一种轻量级维度记忆框架,它将每条记忆表示为带有显式字段(如时间、地点、原因、目的和关键词)的原子化、类型化且自包含的单元。这种表示暴露了维度感知检索、记忆更新和选择性助手上下文回忆所需的结构,从而在不将完整历史存入模型上下文的情况下实现精确回忆。此外,该方案还证明了维度记忆提取可通过紧凑模型微调实现,例如基于Qwen3-4B的微调提取器在性能上超越了使用GPT-4.1-mini的LightMem,进一步降低了计算成本。
链接: https://arxiv.org/abs/2605.15759
作者: Wentao Qiu,Haotian Hu,Fanyi Wang,Jinwei Kong,Yu Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity–efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbfDimMem, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf81.43% and \textbf78.20% overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf24%. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at this https URL.
[NLP-28] Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
【速读】: 该论文试图解决强化学习中的可验证奖励(RLVR)范式下,策略探索受限于已采样轨迹、单纯增加rollout数量导致计算成本高昂且现有优化目标对探索控制不足的问题。解决方案的关键在于提出NudgeRL框架,核心包含两个部分:一是“策略提示”(Strategy Nudging),通过在每个rollout中引入轻量级的策略级上下文(strategy-level contexts)来诱导多样化的推理轨迹,无需依赖昂贵的oracle监督;二是统一优化目标,将奖励信号分解为上下文间(inter-context)和上下文内(intra-context)组件,并引入蒸馏目标,将探索发现的行为迁移回基础策略,从而在保持计算效率的同时实现结构化、多样性驱动的探索,超越同等或更大rollout预算下的标准GRPO及基于特权信息的基线方法。
链接: https://arxiv.org/abs/2605.15726
作者: Chanuk Lee,Sangwoo Park,Minki Kang,Sung Ju Hwang
机构: KAIST (韩国科学技术院); DeepAuto.ai (深度自动AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 7 figures
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at this https URL.
[NLP-29] Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering
【速读】: 该论文试图解决大语言模型(LLM)上下文工程中现有方法将问题视为全局搜索、寻求单一上下文策略以最大化数据集平均性能,从而忽略不同输入需要不同指导、无法挖掘实例级性能潜力的局限性。解决方案的关键在于提出一种范式转变,将上下文工程形式化为推荐问题,并引入神经协同上下文工程(NCCE)框架,其核心是“上下文-协同过滤共进化”(Context-CF Co-Evolution)机制:首先自举构建多样化的锚上下文目录,然后通过轻量级神经协同过滤(NCF)模型学习实例与上下文之间的偏好关系,同时利用新评估的上下文反哺NCF模型以精炼其潜在偏好理解,形成协同反馈回路;在推理时,训练好的NCF模型作为动态上下文路由器,为每个未见实例分配最优上下文策略。通过个性化匹配实例与其最优上下文,NCCE显著提升了任务准确率,强调了LLM上下文工程中个性化的重要性。
链接: https://arxiv.org/abs/2605.15721
作者: Jiachen Zhu,Zhuoying Ou,Congmin Zheng,Yuxiang Chen,Zeyu Zheng,Rong Shan,Lingyu Yang,Lionel Z. Wang,Weiwen Liu,Yong Yu,Weinan Zhang,Jianghao Lin
机构: Shanghai Jiao Tong Univ. (上海交通大学); Univ. College London (伦敦大学学院); Carnegie Mellon Univ. (卡内基梅隆大学); Hong Kong Polytechnic Univ. (香港理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are highly sensitive to their input contexts, motivating the development of automated context engineering. However, existing methods predominantly treat this as a global search problem, seeking a single context strategy that maximizes average performance across a dataset. This restrictive assumption overlooks the fact that different inputs often require distinct guidance, leaving substantial instance-level performance gains untapped. In this paper, we propose a paradigm shift by formulating context engineering as a recommendation problem. We introduce \textbfNeural Collaborative Context Engineering (NCCE), a framework that transitions optimization from a static global search to dynamic, instance-wise routing. NCCE first bootstraps a diverse catalog of anchor contexts and then employs a novel \textbfContext-CF Co-Evolution mechanism. This stage establishes a synergistic feedback loop: a lightweight Neural Collaborative Filtering (NCF) model learns instance-context preferences to guide the generation of specialized context variants, while the newly evaluated contexts continuously refine the NCF model’s understanding of latent preferences. At inference time, the trained NCF model acts as a context router, dynamically assigning the most suitable context strategy to each unseen instance. Theoretical Proofs and comprehensive experiments demonstrate that by matching individual inputs with their optimal contexts, NCCE significantly improves task accuracy, highlighting the critical importance of personalization in LLM context engineering.
[NLP-30] SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
【速读】: 该论文试图解决的问题是:现有的大多数多模态记忆推理基准(benchmarks)仅在预装配的上下文(pre-assembled contexts)中评估系统,未能充分测试智能体(agent)是否能够利用分布在不同独立来源中的证据进行推理,尤其是当相关证据分散于对话、配置文件、截图、表格、图像和文档等异构工件(artifacts)中时,这种源分布记忆组合(source-distributed memory composition)成为多模态智能体记忆的重要但未被充分评估的瓶颈。解决方案之关键是引入了源分布多模态记忆基准(Source-distributed Multimodal Memory Benchmark, SMMBench),该基准专门衡量智能体能否跨多个来源检索(retrieve)、对齐(align)和组合(compose)多模态证据,而非在单一策划上下文(curated context)内推理;它通过评估跨源多模态推理、冲突解决、偏好推理和记忆驱动行动预测四项核心能力,并包含1877个来源于264个源的样本,从而建立了一个系统性的评估框架,揭示了当前系统在这方面的显著不足。
链接: https://arxiv.org/abs/2605.15710
作者: Huacan Chai,Yukai Wang,Yingxuan Yang,Dan Peng,Yuanyi Song,Zhihui Fu,Weiwen Liu,Jianghao Lin,Jun Wang,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO (OPPO)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at this https URL.
[NLP-31] H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
【速读】: 该论文试图解决基于大语言模型(LLM)的智能体在问答任务中利用内存数据时,缺乏对内存随时间演化进行有效建模的机制以及高效检索方法的问题,从而导致内存利用率低、性能不佳。解决方案的关键在于提出了一种名为H-Mem的混合结构内存机制,其核心是同时构建一个时间—语义树结构和一个知识图谱:树结构使短期内存数据渐进演化成长期内存数据,后者提供前者的摘要信息,从而建模长期演化;知识图谱则捕获内存中实体间的关系。通过整合树与图的双重结构,H-Mem实现了高效的混合检索方法,从而显著提升问答任务性能。
链接: https://arxiv.org/abs/2605.15701
作者: Jiawei Yu,Yixiang Fang,Xilin Liu,Yuchi Ma
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Huawei Cloud Computing Technologies CO., LTD. (华为云计算技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents’memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.
[NLP-32] ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在机器遗忘(Machine Unlearning, MU)过程中,现有方法仅基于输出偏差评估遗忘效果而忽视生成质量,导致遗忘后模型易产生幻觉或僵硬回复、影响可用性和安全性的问题。解决方案的关键在于提出ASRU可控多模态遗忘框架,将生成质量作为核心评估目标:首先通过激活重定向(activation redirection)诱导初始拒绝行为,然后利用定制奖励函数(customized reward function)优化细粒度拒绝边界(fine-grained refusal boundaries),从而在目标知识遗忘与模型实用性之间实现更优权衡。
链接: https://arxiv.org/abs/2605.15687
作者: Jiahui Guang,Yingjie Zhu,Cuiyun Gao,Haiyan Wang,Jing Li,Di Shao,Zhaoquan Gu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.
[NLP-33] Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries
【速读】: 该论文旨在解决低资源标签条件下,如何利用大型语言模型(LLMs)对非正式、不完整的在线患者咨询进行自动分诊,将其准确路由至自我护理、安排就诊、紧急临床医生审查或紧急转诊四个临床随访级别的问题。解决方案的关键在于采用提示式LLMs(prompted LLMs)结合小样本学习(few-shot prompting)策略,并在特定类别上引入两模型一致性(two-model agreement)机制;具体方法是借助自动标注的银训练集(silver training set)训练传统基线模型(TF-IDF和BioBERT),并通过构建人工校准的黄金评估集(gold evaluation set)对比LLMs在不同样本数(0-shot、4-shot、12-shot)下的分诊性能,同时引入安全感知指标(如紧急召回率和分诊不足率)进行综合评估。实验表明,最强LLM(Claude Haiku 4.5, 12-shot)在宏F1上超越监督基线,但置信区间重叠,且LLMs仅在自我护理类别上具备可靠的提示一致性,而紧急临床医生审查类别的分诊仍不可靠,因此结论强调LLMs可用于分诊优先级排序和选择性人工审查,尚无法实现自主部署。
链接: https://arxiv.org/abs/2605.15680
作者: Liqi Zhou,Jiafu Li
机构: University of Pennsylvania (宾夕法尼亚大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 4 figures, 19 tables, 23 pages (including appendix and reference)
Abstract:Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task – self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro- F_1 alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro- F_1 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.
[NLP-34] VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing ICML2026
【速读】: 该论文试图解决当前视觉语言模型(VLMs)在处理结构化、可控的图表任务时存在的关键缺陷——现有方法主要依赖基于像素的图像合成,工作在概率像素空间,导致可编辑性和保真度严重受限。解决方案的关键在于提出一种全新的“图表即代码”(Diagram-as-Code)范式,采用符号逻辑,利用mxGraph可扩展标记语言(XML)作为中间表示,实现精确的图表生成和可编辑的代码到代码转换。此外,论文还构建了VCG-Bench统一基准,包含分类数据集、范式定义以及多维评估协议,以系统验证该范式在结构化保真度和指令遵循性上的优势。
链接: https://arxiv.org/abs/2605.15677
作者: Xiaoyan Su,Peijie Dong,Zhenheng Tang,Song Tang,Yuyao Zhai,Kaitao Lin,Liang Chen,Gai Yuhang,Yuyu Luo,Qiang Wang,Xiaowen Chu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026, 37 pages, 10 figures
Abstract:Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \textttmxGraph tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \textttmxGraph Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.
[NLP-35] Dynamic Chunking for Diffusion Language Models
【速读】: 该论文试图解决块离散扩散语言模型(block discrete diffusion language models)中基于固定大小位置块进行序列因子化所带来的结构浪费问题:位置决定的块会割裂语义连贯的token,同时将不相关的token强行组合在一起。解决方案的关键在于提出动态分块扩散模型(Dynamic Chunking Diffusion Model, DCDM),用内容定义的语义块替代位置块,其核心是一个可微分层——Chunking Attention,该层通过可学习子空间将token路由到K个聚类中,并利用扩散目标函数进行端到端优化,由此产生的聚类分配诱导出块因果注意力掩码(chunk-causal attention mask),使得离散扩散去噪器能够基于语义块对序列似然进行自回归因子化,从而在理论上严格推广了块离散扩散方法。
链接: https://arxiv.org/abs/2605.15676
作者: Yichen Zhu,Xiaoming Shi,Peng Zhao,Weiyu Chen,Debing Zhang,James Kwok
机构: CSE, HKUST (香港科技大学计算机科学与工程系); Xiaohongshu Inc. (小红书公司); Alibaba group (阿里巴巴集团); CityUHK (香港城市大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbfDynamic \textbfChunking \textbfDiffusion \textbfModel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into K clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.
[NLP-36] Evaluating Chinese Ambiguity Understanding in Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在处理中文语言歧义时鲁棒性不足的问题,具体针对现有中文歧义数据集(如CHAmbi)可扩展性差、研究主要集中于英语而中文关注有限的现状。解决方案之关键在于:基于潜在歧义理论(Potential Ambiguity Theory),设计了一套半自动流程构建了首个该理论指导下的大规模中文歧义数据集CHA-Gen,包含5,712个句子(2,414句有歧义、3,298句无歧义),覆盖18种潜在歧义结构,并通过直接查询与机器翻译两种方式评估了多种LLM(如Gemma 3、Qwen 2.5/3系列),发现模型在歧义检测上表现欠佳,但链式思维提示(CoT prompting)可提升性能;进一步通过对Qwen3-32B的CoT推理链分析,识别出三种常见失败模式(歧义盲视、归因错误、过早消歧),并利用语义熵(semantic entropy)进行不确定性量化,证实歧义句具有更高的不确定性;同时观察到指令微调会导致过度自信,而基模型更能捕捉语义多样性,且模型普遍偏向主流解释。该工作提供了一个可扩展的中文歧义语料构建方法,并为理解LLM处理歧义的机制奠定了基础。
链接: https://arxiv.org/abs/2605.15635
作者: Junwen Mo,Yuanzhi Lu,Yifang Xue,Ke Xu,Hideki Nakayama
机构: The University of Tokyo(东京大学); South China University of Technology(华南理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B’s CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs’ ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.
[NLP-37] oward LLM s Beyond English-Centric Development
【速读】: 该论文试图解决的问题是:开源大语言模型(open-weight LLMs)在序列生成中存在严重的英语偏见,并且常用的持续预训练(continual pre-training)方法在将模型适配到目标语言时,并不能带来比从头训练(training from scratch)更低的成本优势,甚至对于提升目标语言的文化理解也是如此。解决方案的关键在于:未来LLM的开发应更加重视针对每种语言的专门投入(per-language investment),而非主要依赖以英语为中心的资源的扩展。
链接: https://arxiv.org/abs/2605.15613
作者: Sho Takase,Ukyo Honda
机构: CyberAgent(赛博代理)
类目: Computation and Language (cs.CL)
备注:
Abstract:Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.
[NLP-38] PSD: Pushing the Pareto Frontier of Diffusion LLM s via Parallel Speculative Decoding
【速读】: 该论文旨在解决扩散大语言模型(diffusion large language models, dLLMs)因大量去噪迭代导致推理成本高昂的问题。尽管dLLMs能在每一步并行预测所有掩码位置,但迭代次数仍然影响效率。为此,论文提出了一种无需训练的并行推测解码框架(Parallel Speculative Decoding, PSD),其关键创新在于同时从空间和时间两个维度提升推理效率:在空间上,通过可配置的自适应去掩策略,利用单次前向传播的置信度分数选择多个位置进行去掩;在时间上,无需额外模型调用即可构造多深度推测草稿,将多个去噪步骤合并为一次批量验证调用,并通过分层接受机制,保留与更新预测一致的最深草稿。实验表明,PSD在前向传播中可达5.5倍token数且精度与贪婪解码相当,实现了推理效率与生成质量之间的有利权衡。
链接: https://arxiv.org/abs/2605.15609
作者: Shengyin Sun,Yiming Li,Renxi Liu,Xinqi Li,Hui-Ling Zhen,Weizhe Lin,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Chen Ma
机构: City University of Hong Kong(香港城市大学); Huawei Technologies(华为技术有限公司)
类目: Computation and Language (cs.CL)
备注: 16 pages
Abstract:Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to 5.5\times tokens per forward pass with accuracy comparable to greedy decoding.
[NLP-39] Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
【速读】: 该论文试图解决大型语言模型(LLMs)在代码生成中无法将语义能力迁移至预训练未涵盖的编程语言这一问题:尽管模型能快速习得新语言的语法,但语义理解(即正确实现算法逻辑)仍显著落后于预训练语言(如Python),且现有干预手段(多任务学习、偏好调整、代码填充、潜在空间目标)均无法消除这一差距。其核心发现是“实现保真度缺口”(implementation fidelity gap),即模型内部表示在跨语言时高度相似(CKA > 0.97),表明其具备语言无关的算法理解,但输出阶段却无法有效映射至陌生语言的语法规则。因此,解决方案的关键在于设计能够将推理(reasoning)与语言特定实现(language-specific realization)进行解耦的训练方法,使模型在保持算法理解的同时,学会适应任意目标语言的句法约束。
链接: https://arxiv.org/abs/2605.15607
作者: Vinayshekhar Bannihatti Kumar,Disha Makhija,Manoj Ghuhan Arivazhagan,Rashmi Gangadharaiah
机构: AWS AI Labs(亚马逊云科技人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.
[NLP-40] VSPO: Vector-Steered Policy Optimization for Behavioral Control
【速读】: 该论文旨在解决现代语言模型在同时优化主要准确性目标和次要行为偏好(如冗长、随和性、技术专业性)时遇到的稀疏行为奖励瓶颈问题——即基础模型可能极少或完全不展现目标行为,导致行为奖励信号稀疏,难以有效进行多目标策略优化。解决方案的关键在于引入向量导向策略优化(VSPO),它利用与目标行为相关的导向向量(steering vector)来控制生成rollout的行为强度。VSPO通过修改GRPO(Group Relative Policy Optimization)框架,以不同导向强度采样rollout,这一过程可解释为一种在线潜在自蒸馏(on-policy latent self-distillation),使模型内化其导向向量。通过调整导向强度,VSPO能够上采样稀有行为并丰富rollout多样性,从而缓解稀疏奖励问题,并理论上加速策略优化。
链接: https://arxiv.org/abs/2605.15604
作者: Xuechen Zhang,Zijian Huang,Kai Yang,Weijia Zhang,Jiasi Chen,Samet Oymak
机构: University of Michigan(密歇根大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.
[NLP-41] MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models ACL2026
【速读】: 该论文试图解决的核心问题是:大语言模型(large language models, LLMs)在心理健康领域能否充分捕获相关生物医学知识,并可靠地将其应用于临床重要的结构化判断任务(如实体识别、关系判断和多步推理),而现有评估方法对此尚不明确。解决方案之关键在于构建一个基于知识图谱(knowledge graph, KG)的基准MHGraphBench,它从PrimeKG中衍生出九个任务族,涵盖实体识别、关系判断和两跳推理,并通过知识图谱支持的答案和受控负选项来确保评估的客观性。该基准在15个闭源和开源LLM上的实验揭示了一个持续的识别-判断差距(recognition-to-judgment gap),即模型在实体类型化和关系类型化子集上表现接近天花板,但在关系预测和两跳推理中仍存在困难;同时,输出格式可靠性(output-format reliability)在受限多选设置下会显著影响测量性能,强调了响应有效性在基准评估中的关键作用。因此,该研究的核心贡献是提供了一个结构化的评估框架,能够系统性地测量LLM在心理健康领域对知识图谱知识的对齐程度,而非直接考察真实世界临床安全性。
链接: https://arxiv.org/abs/2605.15589
作者: Weixin Liu,Congning Ni,Shelagh A. Mulvaney,Susannah L. Rose,Murat Kantarcioglu,Bradley A. Malin,Zhijun Yin
机构: Vanderbilt University (范德堡大学); Vanderbilt University Medical Center (范德堡大学医学中心); Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to GEM 2026, ACL 2026 Workshop; 9 pages main text plus references and appendices
Abstract:Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
[NLP-42] Calibrating LLM s with Semantic-level Reward
【速读】: 该论文旨在解决大型语言模型(LLM)在关键场景(如医疗问答、法律推理)中部署时,由于标准强化学习(RLVR)使用二元正确性奖励而忽视置信度,导致模型校准性差——即模型对自信但错误的预测没有惩罚,且基于口头化置信度(verbalized confidence)的校准方法因token级别校准而产生语义不一致的问题。解决方案的关键在于提出了语义校准奖励(Semantic Calibration Reward, CSR)框架,该框架直接在语义空间中校准模型,无需依赖口头化置信度接口:通过结合正确性奖励与新的语义校准奖励,在正确的生成轨迹(rollouts)中促进语义一致性以利用正确信息,在错误的生成轨迹中抑制虚假一致性以鼓励探索,从而显著降低期望校准误差(ECE)并提升受试者工作特征曲线下面积(AUROC),且校准行为在分布内和分布外场景中均表现出鲁棒泛化性。
链接: https://arxiv.org/abs/2605.15588
作者: Fengfei Yu,Ruijia Niu,Dongxia Wu,Yian Ma,Rose Yu
机构: Department of Computer Science and Engineering, University of California San Diego (加利福尼亚大学圣迭戈分校计算机科学与工程系); Halıcıoğlu Data Science Institute, University of California San Diego (加利福尼亚大学圣迭戈分校Halıcıoğlu数据科学研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbfCalibration with Semantic Reward (CSR), a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to 40% and improving AUROC by up to 31% over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.
[NLP-43] Measuring Maximum Activations in Open Large Language Models
【速读】: 该论文试图解决现代开源大语言模型(LLM)在低比特量化部署中面临的激活值动态范围问题——即激活值的最大幅度(activation maxima)在不同模型家族、代际和训练阶段如何变化,以及这种变化对低比特量化的影响。此前针对2024年前LLaMA风格模型的研究已不再适用于当前模型生态,而现有激活量化流程未更新对其的认知。解决方案的关键在于构建一个统一的测量流水线(unified pipeline):使用5000样本的多领域语料库、各家族专有的分词器、在嵌入层、隐藏状态、注意力层、MLP/MoE层、SwiGLU门控以及最终归一化层上设置相同的钩子(hooks),从而在8个开源家族(包括稠密、MoE、视觉语言、中间训练和指令微调变体)的27个检查点上系统测量全局和逐层的激活最大值。通过该框架,作者揭示了激活最大值在相当参数规模下跨越近四个数量级(如Qwen3.5和MoE检查点约10²-10³,而Gemma3-27B-it达约7×10⁵),且MoE检查点的峰值比同规模稠密模型低14.0-23.4倍,残差流(residual stream)在22/24个检查点中承载全局最大值。此外,通过轻量级INT-8验证,证明实测最大值与低比特重建误差通过激活尺度选择共变,从而确立最大激活幅度是模型家族、架构和训练阶段内在属性的结论,而非仅与参数规模简单相关。
链接: https://arxiv.org/abs/2605.15572
作者: Luxuan Chen,Han Tian,Xinran Chen,Rui Kong,Fang Wang,Jiamin Chen,Yuchen Li,Jiashu Zhao,Shuaiqiang Wang,Haoyi Xiong,Dawei Yin
机构: Shanghai Jiao Tong University(上海交通大学); Baidu Inc.(百度公司); Nankai University(南开大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at this https URL.
[NLP-44] GiLT: Augmenting Transformer Language Models with Dependency Graphs
【速读】: 该论文试图解决如何通过融合语言结构信息来增强Transformer语言模型的句法泛化性能的问题,此前相关工作主要集中于使用句法树结构(特别是成分树结构)。解决方案的关键在于提出Graph-Infused Layers Transformer Language Model (GiLT),它利用依存图来注入结构信息:不同于以往方法在语言建模中插入额外结构标记,GiLT通过从逐步随标记预测构建的依存图中提取特征,以调节Transformer中的注意力权重,从而将结构信息融入语言建模过程。实验表明,GiLT在使用语义依存图时,在保持与基线Transformer语言模型相当的困惑度的前提下,实现了更优的句法泛化,并可通过预训练语言模型微调来提升下游任务性能。
链接: https://arxiv.org/abs/2605.15562
作者: Tianyu Huang,Yida Zhao,Chuyan Zhou,Kewei Tu
机构: School of Information Science and Technology, ShanghaiTech University (上海科技大学信息科学与技术学院); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程技术研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at this https URL.
[NLP-45] When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation
【速读】: 该论文试图解决连续扩散和流模型在非自回归文本生成中,连续潜变量(continuous latent states)与离散token(discrete tokens)之间接口不匹配的核心问题。具体而言,即便生成的潜变量在几何度量(如尺度匹配或余弦相似度)上与真实编码器潜变量接近,仍会导致解码器产生高熵、有偏或重复的token分布。解决方案的关键在于将任务框架化为受控的局部细化(controlled local refinement),而非从噪声中完全生成,具体包括:使用完整768维BERT潜变量(而非压缩的256维)以保留更多信息,结合局部流细化(local flow refinement)和融合解码器感知读出(fused decoder-aware readout)来增强解码恢复性,并通过度量学习和最优传输对齐改善潜空间几何,但最终诊断性结果表明,潜空间几何本身不足以保证解码质量,评估连续潜文本生成时应以解码器可恢复性(decoder recoverability)为核心指标。
链接: https://arxiv.org/abs/2605.15557
作者: De Shuai Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 1 figure, 6 tables. Technical Report v1. Stage 1 complete; Stage 2 ongoing Code: this https URL
Abstract:Continuous diffusion and flow models are attractive for non-autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian-start experiments showed that good latent-space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents. With 768-dimensional latents, DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder-aware readout give modest additional gains, while metric learning and OT-style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.
[NLP-46] DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation
【速读】: 该论文试图解决蒸馏视觉-语言模型(Vision-Language Models, VLMs)时,现有prompt选择策略存在关键效率缺陷的问题——标准图表/文档推理数据集中高达69%的prompt属于零delta(zero-delta)样本,即教师模型和学生模型已经诱导出完全相同的答案分布,导致这些prompt提供的学习信号极低,学生性能随数据规模扩大迅速饱和。解决方案的关键在于回归蒸馏的本质原理:蒸馏本质上是在最小化分布散度(distributional divergence),因此一个prompt只有在它暴露了教师与学生之间的功能能力差距(functional capability gap)时才具有价值。通过答案散度(\Delta)量化这一差距,论文证明非零散度(non-zero divergence)对于有效缩放训练至关重要。基于此洞察,论文提出一种分阶段合成流水线(staged synthesis pipeline),将现有数据集作为种子(seeds),主动针对学生失败模式生成更优的prompt,最终构建了DeltaPrompts——一个包含20万条合成高散度推理问题的多样化数据集,在多种蒸馏和微调设置下均能带来高达15%的相对性能提升。
链接: https://arxiv.org/abs/2605.15532
作者: Jaehun Jung,Hyunwoo Kim,Brandon Cui,Ximing Lu,David Acuna,Prithviraj Ammanabrolu,Yejin Choi
机构: NVIDIA Research(英伟达研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ( \Delta ), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) – averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.
[NLP-47] Process Rewards with Learned Reliability
【速读】: 该论文旨在解决当前过程奖励模型(PRM)仅输出单一步骤奖励分数、缺乏预测可靠性指示的问题——下游方法被迫将不完美的步骤级奖励视为可信决策信号,而无法识别何时应信任这些预测。解决方案的关键是提出 BetaPRM,一种分布式过程奖励模型,它通过蒙特卡洛延续获得的步骤成功监督,学习一个 Beta 信念分布(而非直接回归到有限样本的成功比率),从而同时预测步骤级成功概率及其预测的可靠性。这一可靠性信号使得下游应用能够区分可靠奖励与不确定奖励,并基于此开发了自适应计算分配(ACA)策略,在 PRM 引导的 Best-of-N 推理中,当高奖励解决方案可靠时提前终止,对不确定候选前缀分配更多计算,从而在提升最终答案准确率的同时显著降低令牌消耗。
链接: https://arxiv.org/abs/2605.15529
作者: Jinyuan Li,Langlin Huang,Chengsong Huang,Shaoyang Xu,Donghong Cai,Yuyi Yang,Wenxuan Zhang,Jiaxin Huang
机构: Washington University in St. Louis (圣路易斯华盛顿大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy–token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.
[NLP-48] DetectRL-X: Towards Reliable Multilingual and Real-World LLM -Generated Text Detection ACL2026
【速读】: 该论文试图解决当前大语言模型(Large Language Model, LLM)生成内容检测器在真实多语言场景下可靠性未被充分探索的问题。解决方案的关键在于构建一个全面的多语言基准测试集DetectRL-X,它覆盖8种商业常用语言、6个易受LLM滥用影响的领域,使用4种商用LLM生成文本,并引入典型的AI辅助写作操作(如润色、扩写、缩写)以匹配真实使用模式,同时开发多语言释义与扰动攻击框架来模拟多样化的人为修改和写作噪声,从而实现对检测器跨语言的压力测试。该基准通过多维度评估揭示了当前最先进检测器在不同语言资源下的优势与局限,并分析了领域、生成器、攻击策略、文本长度和 refinement 操作对不同语言性能的影响,为强化多语言及特定语言检测器提供了有效工具。
链接: https://arxiv.org/abs/2605.15518
作者: Junchao Wu,Yefeng Liu,Chenyu Zhu,Hao Zhang,Zeyu Wu,Tianqi Shi,Yichao Du,Longyue Wang,Weihua Luo,Jinsong Su,Derek F. Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main
Abstract:The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.
[NLP-49] RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts Provably NEURIPS2026
【速读】: 该论文旨在揭示基于Transformer的长上下文语言模型中旋转位置编码(Rotary Positional Embeddings, RoPE)的内在局限性。通过理论分析,论文证明随着上下文长度增加,RoPE驱动的注意力机制会丧失两个关键性能:一是位置局部性偏好(locality bias),即模型不再倾向于关注邻近位置;二是token相关性一致性(consistency in token relevance),即同一key向量在不同位置获得的注意力分数可能反转。此外,注意力分数可能在key token位置移动甚至被替换时保持不变,表明模型无法有效区分位置或token。这些缺陷的概率趋近于随机猜测(0.5),并且多头部、多层架构无法克服。论文的关键结论是,调整RoPE的基数超参数只能在区分位置与区分token之间权衡,无法同时保留两者,因此未来长上下文Transformer模型可能需要根本性的新型位置编码与token顺序编码机制。论文并未提供具体解决方案,而是指出问题的本质,强调探索新机制的必要性。
链接: https://arxiv.org/abs/2605.15514
作者: Yufeng Du,Phillip Harris,Minyang Tian,Eliu A Huerta,Srikanth Ronanki,Subendhu Rongali,Aram Galstyan,Hao Peng
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Bonn (波恩大学); Argonne National Laboratory (阿贡国家实验室); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 11 figures, submitted to NeurIPS 2026
Abstract:We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today’s long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.
[NLP-50] STS: Efficient Sparse Attention with Speculative Token Sparsity
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)推理中注意力机制二次复杂度带来的严重内存和计算瓶颈问题,尤其针对需要处理百万级token序列的新兴代理应用(agentic applications)。解决方案的关键在于提出一种无需重新训练模型的稀疏注意力机制(Sparse Attention, STS),其核心洞察是:由较小草稿模型(draft model)识别为重要的token,能够高度预测较大目标模型(target model)中的重要token。通过将这一洞察集成到推测解码(speculative decoding)框架中,STS利用草稿模型的注意力分数动态构建逐token和逐头(token-and-head-wise)的稀疏掩码,从而有效剪枝目标LLN中昂贵的注意力计算。实验表明,在代表性基准NarrativeQA上,STS在约90%稀疏度下实现了2.67倍加速,同时相对于稠密注意力(dense attention)仅有可忽略的精度下降,在稀疏度-精度权衡上超越了此前技术,建立了新的最优水平(state-of-the-art)。
链接: https://arxiv.org/abs/2605.15508
作者: Ceyu Xu,Jiangnan Yu,Yongji Wu,Yuan Xie
机构: The Hong Kong University of Science and Technology (香港科技大学); UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 12 figures
Abstract:The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model’s attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.
[NLP-51] FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
【速读】: 该论文试图解决现有金融领域大语言模型基准(如FinQA、ConvFinQA等)缺乏专业难度层次化评估,以及无法充分衡量模型从基础金融知识到专家级推理能力过渡的问题。其关键解决方案是提出了FINSSE-Bench——一套包含8个专门基准、共计3993个问题的层次化评估套件,通过整合基于专业资格认证(CFA 1-3级、CMT 2级、CFTe 1级)的考题集、应用交易任务集合以及俄语奥林匹克竞赛基准,实现了对模型领域广度、难度递增下的性能退化、计算题求解能力及专业领域行为的多维度评估,并配套了涵盖多项选择、数值答案和简短开放式回答的统一评估协议,以及基于大语言模型评判(LLM-as-judge)的自动评分机制。
链接: https://arxiv.org/abs/2605.15482
作者: Dmitry Stanishevskii,Nini Kamkia,Alexey Khoroshilov,Dmitry Zmitrovich,Denis Kokosinskii,Zhirayr Hayrapetyan,Andrei Kalmykov
机构: Lime FinTech (Lime金融科技)
类目: Computation and Language (cs.CL)
备注: 21 pages, 10 tables, 2 figures
Abstract:Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.
[NLP-52] Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction
【速读】: 该论文旨在解决从对话式护士-患者对话记录中提取可操作观察信息(actionable observations)并将其规模化转换为结构化表示的挑战,以减轻临床文档负担,使医护人员能将更多时间用于直接患者护理。解决方案的关键在于提出一种模块化检索增强生成(Retrieval-Augmented Generation, RAG)流水线,该流水线利用训练集作为范例语料库,结合模式约束提示(schema-constrained prompting,包括完整模式与剪枝候选模式两种策略)、确定性基于模式的后处理(deterministic schema-based postprocessing)以及第二轮审计(second-pass audit),并采用两种大型语言模型(LLM)骨干——Llama-4-Scout-17B-16E-Instruct和GPT-5.2,配合对应的嵌入模型进行RAG。最佳配置(GPT-5.2 + 完整模式 + RAG + 第二轮审计)达到了80.36%的F1分数。核心发现是:RAG一致性地提升了性能;模式约束的最优程度依赖于所选模型;第二轮审计通过纠正残留的模式遵循错误(schema-adherence errors)带来了适度的额外增益。
链接: https://arxiv.org/abs/2605.15467
作者: A H M Rezaul Karim,Ozlem Uzuner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.
[NLP-53] GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
【速读】: 该论文试图解决强化学习后训练中,领域内强化学习(in-domain RLVR)虽然在推理任务上表现优异,但需要大量GPU计算资源(如数据量和训练时长),从而限制了其广泛应用的问题。解决方案的关键在于:提出GRLO方法,即从零开始,在开放环境中仅利用少量交互(5K prompts和22.7 GPU小时)进行基于人类反馈的强化学习(RLHF),通过显式习得的对话能力隐式泛化到数学推理和代码生成等下游任务,从而大幅降低训练成本(数据和计算分别减少约46倍和68倍),并在平均性能上接近或超越需要更高训练成本的官方后训练模型。
链接: https://arxiv.org/abs/2605.15464
作者: Shangjian Yin,Yu Fu,Yue Dong,Zhouxing Shi
机构: University of California, Riverside(加州大学河滨分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about 46\times less data and 68\times less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen’s released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \hrefthis https URLthis https URL.
[NLP-54] Reasoning Models Dont Just Think Longer They Move Differently
【速读】: 该论文试图解决的核心问题是:在推理训练的语言模型(reasoning-trained language models)中,更长的思维链(chain-of-thought)究竟仅代表计算步数的增加,还是伴随有不同的内部轨迹(internal trajectory)变化,即模型是否在更难的推理问题上采取了不同的计算路径。现有分析中,原始轨迹的几何特征(如路径长度、曲率等)被生成长度(generation length)的机械性影响所掩盖,导致直接比较困难程度不同的生成结果产生误导。解决方案的关键在于引入长度校正(length correction)步骤,通过对轨迹统计量(trajectory statistics)进行残差化(residualization)处理,消除生成长度的混杂效应,从而在修正后的轨迹几何(corrected trajectory geometry)上观察难度(problem difficulty)与内部动态的耦合(coupling)。该研究在竞争性编程、数学和布尔可满足性三个领域验证了校正后难度-几何耦合的存在,并发现代码领域中推理训练模型的校正轨迹更直接(more direct)、局部曲率异质性更低(less heterogeneous local curvature),且这种效应与策略转变(strategy shifts)和不确定性监控(uncertainty monitoring)的行为注释相关,强于指令微调基线。因此,长度校正被确立为生成时间轨迹分析(generation-time trajectory analysis)的前提条件,而推理训练能改变校正轨迹几何的领域特异性(domain-specific)特征。
链接: https://arxiv.org/abs/2605.15454
作者: Anders Gjølbye,Lars Kai Hansen,Sanmi Koyejo
机构: Technical University of Denmark(丹麦技术大学); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Preprint
Abstract:Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.
[NLP-55] Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis
【速读】: 该论文试图解决语言模型(LM)在句法歧义(尤其是花园路径句)中计算出的surprisal系统性低估人类实际处理难度的问题,其背后假设是LM与人类在同时考虑不同句子解释(parses)的计算约束上存在差异。解决方案的关键在于通过采用带有词同步束搜索(word-synchronous beam search)的递归神经网络文法(RNNGs),系统性地改变用于计算单词surprisal的同时活跃解析数量(即束宽),并据此预测人类阅读时间,从而检验“减少LM同时可用的解释数量能否增大预测的困难效应以匹配人类数据”这一假设。尽管减少束宽确实增大了花园路径效应的预测幅度,但远不足以完全捕捉人类效应的强度,从而排除了该假设作为主要解释源的可能性。
链接: https://arxiv.org/abs/2605.15440
作者: William Timkey,Brian Dillon,Tal Linzen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.
[NLP-56] Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance
【速读】: 该论文旨在揭示不同大型语言模型(LLM)架构(编码器与解码器)在处理多样化认知任务时的神经激活模式差异,具体通过测量最终激活值、注意力熵和稀疏模式,分析144个任务-模型组合,从而解释架构特性如何影响任务表现。解决方案的关键在于系统性地量化注意力熵与稀疏模式等指标,并对比数学推理等具体任务中的表现,发现解码器模型具有更高的稀疏性,而数学推理任务在所有架构中均产生最高注意力熵,为模型选择与优化提供了基于任务特性的计算依据。
链接: https://arxiv.org/abs/2605.15436
作者: Mahdi Naser-Moghadasi,Faezeh Ghaderi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, accepted at IEEE BigData 2025
Abstract:This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.
[NLP-57] From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM -Based Alpha Factor Discovery
【速读】: 该论文旨在解决现有基于大语言模型(LLM)的alpha因子发现方法中,由于依赖提示级(prompt-level)生成-评估-反馈循环而导致的一系列问题:随着循环迭代次数增加,历史候选与反馈的重复追加引发上下文膨胀(context explosion)、推理成本上升、有用信息被稀释以及反馈漂移(feedback drift);此外,大型LLM的稳定生成偏好容易产生结构相似的表达式、冗余候选,并导致搜索停滞。论文提出的解决方案是QuantEvolver,一个基于强化微调(reinforcement fine-tuning)的自演化alpha因子发现框架。其关键创新在于:将可执行的量化评估(Regime Backtest)直接转化为策略更新(policy updates),通过参数学习使Miner LLM内化历史优化经验,而非在提示中持续累积反馈。具体而言,框架通过构建高质量种子因子(seed factors)、建立多样化的种子-时间窗口训练任务(seed-time-window training tasks)、生成可执行的因子DSL表达式,并利用多样性-互补性奖励(Diversity-Complementarity Reward)来优化Miner LLM的参数,同时将挖掘到的高质量因子持续存入Mined Factor Database,形成最终的因子库。这一机制从根本上避免了上下文爆炸和反馈漂移,并通过奖励设计提升了因子池的质量和互补性。
链接: https://arxiv.org/abs/2605.15412
作者: Lingzhe Zhang,Tong Jia,Yunpeng Zhai,Zixuan Xie,Chiming Duan,Minghua He,Philip S. Yu,Ying Li
机构: Peking University(北京大学); Alibaba Group(阿里巴巴集团); Nanjing University(南京大学); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation–evaluation–feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textscQuantEvolver, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textscQuantEvolver converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textscQuantEvolver constructs high-quality seed factors, builds diverse seed–time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textscQuantEvolver, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.
[NLP-58] Capability Conditioned Scaffolding for Professional Human LLM Collaboration
【速读】: 该论文试图解决大型语言模型个性化(personalization)仅调整输出以适应用户偏好和风格,却忽视用户在不同专业领域评估能力差异的问题,由此引发的“专业领域漂移”(Professional Domain Drift)风险——即用户在自己无法可靠评估的领域过度依赖AI生成的推理。解决方案之关键是引入“能力条件支架”(Capability Conditioned Scaffolding),这是一个类型化框架(typed framework),它将用户专业知识划分为强(strong)、混合(mixed)和弱(weak)三个领域,并根据结构化的能力档案(structured capability profiles)条件性地调节AI的干预行为。通过在多个MMLU子集和四种LLM基底上的初步评估,该框架展现了与能力档案一致的干预行为,包括档案互换下的分类反转(categorical inversion)以及混合领域风险区的选择性激活。这表明,能力感知支架(capability aware scaffolding)能够超越风格个性化,支持更可靠的人机专业协作。
链接: https://arxiv.org/abs/2605.15404
作者: Sen Yang,Yinglei Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.
[NLP-59] Adesua: Development and Feasibility Study of an AI WhatsApp Bot for Science Learning in West Africa
【速读】: 该论文试图解决撒哈拉以南非洲地区因师生比例高和合格教师短缺而导致学生难以获得个性化学习支持及形成性评估的问题。解决方案的关键在于提出了Adesua——一个基于WhatsApp的AI教学助手,它借助WhatsApp在非洲的高普及率,将整合的教科书与33年国家考试题目同生成式AI(Generative AI)相结合,通过聊天机器人实现对话式问答和自动评分反馈,从而为西非初中和高中生提供可访问、与课程对齐的低成本个性化学习支持。
链接: https://arxiv.org/abs/2605.15376
作者: George Boateng,Evans Atompoya,Philemon Badu,Samuel John,Samuel Ansah,Patrick Agyeman-Budu,Victor Wumbor-Apin Kumbol
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Sub-Saharan Africa faces persistently high student-teacher ratios and shortages of qualified teachers, limiting students’ access to personalized learning support and formative assessment. To address this challenge, we present Adesua, a WhatsApp-based AI Teaching Assistant for science education that extends the Kwame for Science platform. Adesua leverages WhatsApp’s widespread adoption in Africa to provide accessible, curriculum-aligned learning support for Junior High School (JHS) and Senior High School (SHS) students across West Africa. The system integrates curated textbooks and 33 years of national examination questions with generative AI to enable conversational question answering and automated assessment with feedback via a WhatsApp bot. Students can ask science questions, take timed or untimed multiple-choice tests by topic or exam year, and receive instant grading and detailed explanations of correct and incorrect responses. A 6-month feasibility deployment in 2025 had 56 active users in Ghana, including students and parents. Quantitative evaluation showed a high perceived usefulness, with a helpfulness score of 93.75% for AI-generated answers, albeit with a small number of ratings (n=16). These preliminary results provide a basis for more extensive future evaluation of a WhatsApp-based AI assistant to assess its potential to offer scalable, low-cost personalized learning support and formative assessment in resource-constrained educational contexts.
[NLP-60] Greedy or not here I come: Language production under vocabulary constraints in humans and resource-rational models
【速读】: 该论文试图解决在词汇受限(仅使用有限高频词汇)条件下人类如何规划沟通以最大化可理解性的问题,并探究其行为与贪心采样或全局最优采样策略的一致性。解决方案的关键在于:利用大语言模型(Large Language Models)结合序列蒙特卡洛推断(Sequential Monte Carlo inference)分别模拟贪心与全局最优的采样算法,系统性地比较这两种算法与人类面对不同词汇限制(最严格时仅250个高频词)时的回答模式,从而揭示人类认知的资源理性(resource-rational)特征。
链接: https://arxiv.org/abs/2605.15365
作者: Thomas Hikaru Clark,Sihan Chen,Laura Nicolae
机构: Department of Brain and Cognitive Sciences, MIT (麻省理工学院脑与认知科学系); Department of Economics, Harvard University (哈佛大学经济系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Communicating using only a limited vocabulary is a common but challenging cognitive phenomenon, requiring an ideal communicator to plan carefully to optimize for intelligibility while circumventing a constrained lexicon. In this work, we investigate how humans respond to a broad array of questions under variable vocabulary limitations, consisting of only 250 highly frequent words at the most restrictive. We provide theoretically motivated comparisons to greedy and globally optimal sampling algorithms using Sequential Monte Carlo inference with large language models. Humans generally resemble greedy sampling more than globally optimal sampling, though more skilled humans are more likely to backtrack and revise – a non-greedy behavior. An observed human pattern of leaning on semantically light words in high-constraint settings falls out of both greedy and globally optimal sampling. We discuss the results and their broader implications for resource-rational cognition, psycholinguistics, L2 communication, and language impairments.
[NLP-61] From I/O to Code with Discovery Agent
【速读】: 该论文旨在解决从输入-输出行为自动合成程序(IO2Code)这一根本性挑战,相比于已取得巨大成功的自然语言到代码(NL2Code),IO2Code无法利用预训练中习得的语义对齐,而需要从具体的计算行为中恢复底层原理,在广阔且欠定的假设空间中进行搜索。解决方案的关键在于提出了DIO-Agent,一个将IO2Code框架化为离散程序空间上进化搜索的发现代理:其中大语言模型(LLM)充当变异算子,由执行产生的具体错误信号引导每次变异;同时引入转换优先级前提(Transformation Priority Premise)作为变异先验,使LLM偏向于与当前证据一致的最简单假设,仅在简单构造不足以解释时逐步升级为常量、条件再到迭代,从而防止搜索陷入结构复杂但错误的死胡同。
链接: https://arxiv.org/abs/2605.15334
作者: Yihong Dong,Jiaru Qian,Haoran Zhang,Peixu Wang,Binhua Li,Zhi Jin,Yongbin Li,Ge Li,Xiaokang Yang,Xue Jiang
机构: School of Computer Science, Peking University (北京大学计算机学院); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); Wuhan University (武汉大学); Renmin University of China (中国人民大学); National University of Singapore (新加坡国立大学); Shanghai Jiaotong University (上海交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from concrete computational behavior, navigating a vast and underspecified hypothesis space. To address this, we propose DIO-Agent, a discovery agent for IO2Code. Our method frames IO2Code as an evolutionary search over discrete program space, in which an LLM serves as the mutation operator and concrete error signals from execution guide each mutation. To prevent the search from wandering into structurally complex yet incorrect dead ends, we introduce the Transformation Priority Premise as a mutation prior that biases the LLM toward the simplest hypothesis consistent with current evidence, progressively escalating from constants to conditionals to iteration only when simpler constructs are insufficient. To facilitate systematic study, we further construct an IO2CodeBench spanning multiple difficulty levels. Extensive experiments show that DIO-Agent consistently outperforms both traditional program-by-example method and SOTA evolution-agent baselines across all difficulty levels and various LLMs, while substantially surpassing test-time scaling strategies with equivalent sampling budgets.
[NLP-62] Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
【速读】: 该论文试图解决大型语言模型驱动的编码代理(LLM-powered coding agents)在读取仓库文件时消耗大量令牌预算,但检索的代码中大部分与当前任务无关,而现有单目标序列标注器(single-objective sequence labeler)因将代码相关性的所有方面压缩为一个分数和转移矩阵,导致建模瓶颈——单个条件随机场(CRF)转移先验无法同时适应异构的保留模式(如连续的语义跨度和稀疏的结构支持行)的问题。解决方案之关键在于提出了LaMR(Latent Multi-Rubric)框架,它将代码相关性分解为两个可解释的质量维度——语义证据(semantic evidence)和依赖支持(dependency support),每个维度由专用的CRF建模并采用维度特定的转移动态;再通过混合专家门控网络(mixture-of-experts gating network)根据查询动态加权每个维度的发射分数,并在融合后的发射上使用最终的CRF层产生保留或剪枝的决策。此外,通过基于抽象语法树(AST)的程序分析从现有训练语料中自动推导出多维度标签,无需额外标注成本,同时去除了教师标签中的噪声,从而有效过滤干扰信息,使LaMR在多数情况下匹配甚至超越未剪枝的全上下文基线。
链接: https://arxiv.org/abs/2605.15315
作者: Jingjing Wang,Xiwen Chen,Wenhui Zhu,Huayu Li,Zhengxiao He,Feiyang Cai,Ana S. Carreon-Rascon,Xuanzhao Dong,Feng Luo
机构: Clemson University (克莱姆森大学); Morgan Stanley (摩根士丹利); Arizona State University (亚利桑那州立大学); University of Arizona (亚利桑那大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher’s binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.
[NLP-63] DiscoExplorer: An Open Interface for the Study of Multilingual Discourse Relations
【速读】: 该论文试图解决跨语言话语关系(discourse relations)研究的复杂性和可及性问题,具体包括:不同数据集中话语关系清单(discourse relation inventories)的标准化虽有进展,但因数据复杂度高、缺乏易用的分析界面,难以广泛用于跨语言对比研究。解决方案的关键在于开发并公开一个名为DiscoExplorer的开源网络界面,该界面可在本地运行,通过集成DISRPT共享任务中涵盖16种语言的话语关系分类数据集,提供查询语言(query language)、搜索和可视化(visualization)功能,支持对关系及其信号设备(如连接词)的分析,从而降低研究门槛,促进跨语言话语关系研究。
链接: https://arxiv.org/abs/2605.15304
作者: Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The relations connecting propositions in discourse such as cause (A because B) or concession (A although B) are a subject of intense interest in Computational Linguistics and Pragmatics, but challenging to study and compare across languages. Recent progress in standardizing discourse relation inventories across datasets offers the potential to facilitate such studies, but is hindered by the complexity of relevant data and the lack of easily accessible interfaces to analyze it. In this paper we present DiscoExplorer, a new open source web interface, capable of running on local computers, which we use to make datasets from the DISRPT Shared Task on discourse relation classification publicly available, covering 16 different languages. We present the query language, search and visualization facilities for relations and signaling devices such as connectives, as well as some example studies.
[NLP-64] PhysBrain 1.0 Technical Report
【速读】: 该论文旨在解决机器人轨迹数据因覆盖范围有限而无法支撑泛化的物理理解学习的问题。其解决方案的核心在于:通过数据引擎从大规模人类第一人称视频中提取场景元素、空间动态、动作执行与深度感知关系,并将其转化为结构化问答监督来训练物理视觉-语言模型(PhysBrain VLM),从而获得物理常识先验;随后,通过一种保留原始能力且对语言敏感的适配设计(capability-preserving and language-sensitive adaptation design),将这些物理先验迁移至视觉-语言-动作策略(VLA policy),实现从多模态理解到机器人动作的有效桥接。
链接: https://arxiv.org/abs/2605.15298
作者: Shijie Lian,Bin Yu,Xiaopeng Lin,Changti Wu,Hang Yuan,Xiaolin Hu,Zhaolong Shen,Yuzhuo Miao,Haishan Liu,Yuxuan Tian,Yukun Shi,Cong Huang,Kai Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.
[NLP-65] Fluency and Faithfulness in Human and Machine Literary Translation
【速读】: 该论文试图揭示文学翻译中目标语流畅性(fluency)与源文忠实度(faithfulness)之间的潜在矛盾关系,特别是考察当前大语言模型(LLMs)在生成流畅译文时是否同时保持了语义的准确传递。解决方案的关键在于:采用大规模多源语言语料(涵盖16种源语言、106部小说的130,486个翻译段落),通过基于词性n-gram(part-of-speech n-grams)的翻译ese分类器量化流畅性(以“原语相似度”为指标),并使用自动翻译评估指标COMET-KIWI评估忠实度,同时控制段落长度变量,系统分析两者间的相关性。该方法通过跨翻译系统(人工翻译、Google Translate、TranslateGemma)的对比,揭示了流畅性与忠实度之间存在一致的负相关,但该负相关在TranslateGemma上较弱且常不显著,从而表明文学翻译中流畅性与忠实度存在权衡关系,且段落长度是影响自动评估的重要因素。
链接: https://arxiv.org/abs/2605.15282
作者: Sarah Griebel,Ted Underwood
机构: University of Illinois Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校); School of Information Sciences (信息科学学院)
类目: Computation and Language (cs.CL)
备注: Accepted NLP4DH 2026
Abstract:Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.
[NLP-66] PerfCodeBench: Benchmarking LLM s for System-Level High-Performance Code Optimization
【速读】: 该论文试图解决现有大型语言模型(LLMs)在生成高性能系统级代码方面能力评估不足的问题,具体而言,现有代码基准主要侧重于功能正确性或算法求解,而对硬件感知优化、并行化、GPU操作等系统级性能瓶颈的处理缺乏系统性评估。解决方案之关键在于提出了一个名为PerfCodeBench的可执行基准(executable benchmark),该基准包含需要系统级实现选择、硬件感知优化和性能瓶颈精细处理的任务,每个任务均配备可执行正确性检查、基线实现及参考优化方案,从而能够同时评估LLMs生成代码的正确性与运行时效率,并暴露当前模型在并行化、跨语言鲁棒性及达到专家级效率方面的显著差距。
链接: https://arxiv.org/abs/2605.15222
作者: Huihao Jing,Wenbin Hu,Haochen Shi,Hanyu Yang,Sirui Zhang,Shaojin Chen,Haoran Li,Yangqiu Song
机构: HKUST(香港科技大学); NYU(纽约大学); SWPUL(西南政法大学); MODEIO.AI(莫德AI)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
Abstract:Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at this https URL.
[NLP-67] Effective Harness Engineering for Algorithm Discovery with Coding Agents
【速读】: 该论文试图解决在大语言模型(LLMs)与进化搜索结合进行自动算法发现时,执行基础设施(harness)的设计问题,具体包括三个子问题:在固定token预算下,生成大量算法但每步仅做浅层思考,与生成少量算法但每步进行深层思考哪种策略更优;如何处理生成代码利用评分函数的评估作弊(evaluation hacks);以及如何让需要完整文件系统访问的智能体安全并行执行。解决方案的关键是提出并实现Vesper框架,其中集成了针对上述问题的harness改进。核心发现是:在相同token预算下,产生更少数量的算法但每个算法经过更深层思考(即提升每个个体的质量)比单纯增加进化代数更节省预算;同时,能力更强的模型产生评估作弊的比率更高,因此随着模型规模扩展,作弊检测机制变得愈发必要。
链接: https://arxiv.org/abs/2605.15221
作者: Yoichi Ishibashi,Taro Yano,Masafumi Oyamada
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly, generating fewer algorithms while thinking more deeply about each one achieved higher scores. That is, scaling the quality of each individual is more budget-efficient than scaling the number of evolutionary generations. Surprisingly, more capable models produced evaluation hacks at higher rates, making hack detection increasingly necessary as models scale.
[NLP-68] Always Learning Always Mixing: Efficient and Simple Data Mixing All The Time
【速读】: 该论文试图解决语言模型训练全生命周期中数据混合(data mixing)方法的碎片化问题——现有方法分别针对预训练、持续学习或适应阶段设计,依赖于固定的代理模型或领域集合,缺乏统一且高效的在线决策框架。解决方案的关键在于将数据混合视为一个在线决策问题,并引入OP-Mix算法:通过在当前模型上直接训练低秩适配器(low-rank adapters)并插值模拟候选数据混合的成本,从而避免使用独立的代理模型,确保数据混合搜索始终基于模型实时的学习动态。这种方法在预训练、持续中期训练和持续指令微调中均能接近最优混合性能,同时大幅降低计算开销。
链接: https://arxiv.org/abs/2605.15220
作者: Michael Y. Hu,Apurva Gandhi,Kyunghyun Cho,Tal Linzen,Pratyusha Sharma
机构: New York University(纽约大学); Carnegie Mellon University(卡内基梅隆大学); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem – one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model’s actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.
[NLP-69] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models ICLR2026
【速读】: 该论文旨在解决程序化材质节点图(material node graphs)的自动生成难题,这类图在计算机图形学中通过有向无环图结构实现参数化、模块化的外观建模,但手工创建需要专业训练,而现有的神经程序合成方法仅将图表示为文本程序,忽视了节点图本身固有的视觉-空间特性,导致人类难以直观理解和使用。解决方案的关键在于提出了一个名为MultiMat的多模态程序合成框架,该框架利用大型多模态模型(large multimodal models)同时处理节点图的视觉表示(如布局、连接)和文本表示(如节点类型、参数),从而更完整地捕捉其语义;此外,框架结合了一个新的高质量程序化材质数据集以及一种约束树搜索(constrained tree search)推理算法,在保证静态正确性的前提下高效搜索程序空间。实验证明,这种多模态方法在无条件和条件图合成任务中均优于纯文本基线,实现了更高的视觉质量和保真度。
链接: https://arxiv.org/abs/2509.22151
作者: Jonas Belouadi,Tamy Boubekeur,Adrien Kaiser
机构: University of Mannheim (曼海姆大学); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at ICLR 2026 (poster)
Abstract:Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
信息检索
[IR-0] Argus: Evidence Assembly for Scalable Deep Research Agents
链接: https://arxiv.org/abs/2605.16217
作者: Zhen Zhang,Liangcai Su,Zhuo Chen,Xiang Lin,Haotian Xu,Simon Shaolei Du,Kaiyu Yang,Bo An,Lidong Bing,Xinyu Wang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model’s limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator’s reasoning context stays under 21.5K tokens.
[IR-1] MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos
链接: https://arxiv.org/abs/2605.16120
作者: Anh-Tai Pham-Nguyen,Tung-Duong Le-Duc,Anh-Duy Le,Trung-Hieu Truong-Le
类目: Information Retrieval (cs.IR)
备注: Accepted to SOICT 2025
Abstract:The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all results for every query in the final round.
[IR-2] Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization
链接: https://arxiv.org/abs/2605.16007
作者: Fujun He,Chuyue Ye,Huaxiang Cai,Zetao Lv,Baolong Cui,Wenru Yan,Chao Zhan,Zigang Zhang,Hao Yi,Jie Xiang,Xiabing Li,Yuhang Gai,Ziyang Zhang,Pengfei Zheng,Yunfei Du
类目: Information Retrieval (cs.IR)
备注:
Abstract:Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory bandwidth limitations. While Neural Processing Units (NPUs) offer orders-of-magnitude higher compute density, existing CPU/GPU-optimized 1-bit RaBitQ quantization implementations cannot be directly ported to NPU architectures due to fundamental hardware mismatches, and homogeneous design paradigms struggle to simultaneously balance accuracy, memory footprint, and performance. This paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search, built on the core insight that decoupling coarse ranking (NPU) from fine ranking (CPU) allows each stage to leverage its optimal hardware, breaking the long-standing accuracy-memory-performance trade-off. We propose a three-stage heterogeneous pipeline comprising AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors. We introduce four NPU architecture-native optimizations: fused AIC-AIV operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. Evaluation on standard datasets shows that Ascend-RaBitQ achieves 3.0* to 62.8* faster index construction than the CPU baseline, up to 4.6* throughput improvement over the fastest CPU IVF-RaBitQ implementation, and over 100* over the mathematically equivalent CPU baseline, while demonstrating encouraging scalability on distributed multi-NPU systems.
[IR-3] Generative Long-term User Interest Modeling for Click-Through Rate Prediction
链接: https://arxiv.org/abs/2605.15905
作者: Jiangli Shao,Kaifu Zheng,Hao Fang,Huimu Ye,Zhiwei Liu,Bo Zhang,Shu Han,Xingxing Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling long-term user interests with massive historical user behaviors enhances click-through rate (CTR) prediction performance in advertising and recommendation systems. Typically, a two-stage framework is widely adopted, where a general search unit (GSU) first retrieves top- k relevant behaviors towards the target item, and an exact search unit (ESU) generates interest features via tailored attention. However, current target-centered GSU would ignore other latent user interests, leading to incomplete and biased interest features. Additionally, the matching-based retrieval process in GSUs depends on the pairwise similarity score between target item and each historical behavior, which not only becomes time-consuming for online services as user behaviors continue to grow, but also overlooks the interaction information among user behaviors. To combat these problems, we propose a \textbfGenerative \textbfLong-term user \textbfInterest model named GenLI for CTR prediction. GenLI consists of an interest generation module (IGM), a behavior retrieval module (BRM), and an interest fusion module (IFM). The IGM generates multiple interest distributions to indicate different aspects of real-time user interests, which is target-independent and incorporates interaction information among behaviors, ensuring complete and diverse interest features. The BRM selects related behaviors via a simple lookup operation, reducing the time complexity for weighting each behavior to O(1) . Finally, the IFM uses delicate gating mechanisms to generate interest features. Based on the generation process, GenLI improves the diversity of user interests and avoids complex matching-based behavioral retrieval, achieving a better balance between accuracy and efficiency for CTR prediction.
[IR-4] Fairness-Aware Retrieval Optimization for Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2605.15790
作者: Yingqi Zhao,Vasilis Efthymiou,Jyrki Nummenmaa,Kostas Stefanidis
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) improves reliability of large language models by incorporating external knowledge, but the retrieval process can introduce bias that propagates to generated outputs. This issue is particularly challenging in top-k settings, where multiple documents jointly influence generation. We propose a fairness-aware retrieval framework that models and controls this bias. Our approach combines controlled bias injection via reranking, a position-aware model of bias propagation, and an optimization formulation that balances relevance and fairness. We further introduce a scalable solution based on Quadratic Fairness via Dual Hyperplane Approximation (FARO), which enables efficient optimization through problem decomposition. Experimental results show that our method effectively mitigates generation bias while preserving relevance. This work provides a principled approach for fairness-aware retrieval in RAG systems.
[IR-5] X-SYNTH: Beyond Retrieval – Enterprise Context Synthesis from Observed Human Attention
链接: https://arxiv.org/abs/2605.15505
作者: Guruprasad Raghavan,George Nychis,Rohan Narayana Murthy
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 11 pages, 7 figures, 5 tables
Abstract:In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened [2, 52]. The prevailing approach [17, 31, 34, 36] retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual [5, 57, 61], present in behavioral patterns, absent from any retrieval index. For complex agentic tasks it breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in human attention, the digitally observable interaction signatures of each worker, encoding not just what they did but the sequence in which they did it, along with implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual’s behavioral baseline as a Digital Twin Signature (DTS) and selects among seven qualitatively distinct attention filters: Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. On a sales lead identification task, a frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and human attention is its most reliable ground truth.
[IR-6] Jobs AI Exposure Should Be Measured from Evidence Not Model Priors
链接: https://arxiv.org/abs/2605.15474
作者: Luca Mouchel,Pierre Bouquet,Yossi Sheffi
类目: Information Retrieval (cs.IR)
备注:
Abstract:This position paper argues that job exposure to AI should be measured with grounded, evidence-based methods, not inferred from LLM priors alone. Current theoretical exposure measures use zero-shot prompting to classify task-level AI exposure, generating labels with no explicit evidence, no transparent chain of reasoning, and no external validation. The stakes of these measurements are too high to rely on such methods, as they influence policy making, where public and private funds are directed, and how workers understand their future prospects. We therefore argue that AI capability claims should meet three standards: reproducibility, external grounding, and inspectability. We propose a retrieval-augmented framework that assigns AI exposure labels to all 18,796 occupation–task pairs in O*NET 30.2, using open-weight reasoning and instruct models with retrieved news articles and academic paper abstracts as evidence of current AI capabilities. Relative to a zero-shot baseline, the grounded condition is preferred in over 72% of disagreement cases under both automatic and human evaluation, and yields scores that align more closely with observed real-world AI usage. Taken together, these findings suggest that evidence-grounded measurement better captures what current AI systems can plausibly do in practice, rather than what a model asserts without external evidence. Because AI capabilities continue to change, the measurements used to inform policy must evolve with them: theoretical AI exposure scores should be periodically reassessed, not inherited as immutable ground truth.
[IR-7] Differentially Private Motif-Preserving Multi-modal Hashing
链接: https://arxiv.org/abs/2605.15460
作者: Zehua Cheng,Wei Dai,Jiahao Sun
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 Pages
Abstract:Cross-modal hashing enables efficient retrieval by encoding images and text into compact binary codes. State-of-the-art methods rely on semantic similarity graphs derived from user interactions for supervision, yet these graphs encode sensitive behavioral patterns vulnerable to link reconstruction attacks. Existing privacy-preserving approaches fail on graph-structured data: Differentially Private SGD destroys relational motifs by treating samples independently, while graph synthesis methods suffer from unbounded local sensitivity in scale-free networks, hub nodes cause single-edge modifications to alter triangle counts by \mathcalO(N) , necessitating prohibitive noise injection. We term this phenomenon Hubness Explosion. We propose DMP-MH, a Sanitize-then-Distill framework that decouples privacy from representation learning. Our approach first bounds sensitivity by deterministically clipping node degrees, capping the L_2 -sensitivity of triangle motifs independently of dataset size. A sanitized synthetic graph is then generated via Noisy Mirror Descent under (\epsilon,\delta) -Edge Differential Privacy. Finally, dual-stream hashing networks distill this topology using a holistic structural loss that enforces cross-modal alignment. Evaluated on MIRFlickr-25K and NUS-WIDE under a strict inductive protocol, DMP-MH outperforms private baselines by up to 11.4 mAP points while retaining up to 92.5% of non-private performance.
[IR-8] Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction Topological Analysis and Ontology-Driven Clustering
链接: https://arxiv.org/abs/2605.15362
作者: Volodymyr Ovcharov
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 15 pages, 7 figures, 2 tables, 21 references
Abstract:Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 - 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data. Comments: 15 pages, 7 figures, 2 tables, 21 references Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR) Cite as: arXiv:2605.15362 [cs.CL] (or arXiv:2605.15362v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.15362 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-9] Fortress: A Case Study in Stabilizing Search Recommendations via Temporal Data Augmentation and Feature Pruning
链接: https://arxiv.org/abs/2605.15299
作者: Milind Pandurang Jagre,Jia Huang,Dayvid V. R. Oliveira,Zhinan Cheng,Babak Seyed Aghazadeh,Puja Das,Chris Alvino,Jinda Han,Kailash Thiyagarajan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In search and recommendation systems, predictive models often suffer from temporal instability when certain input features introduce volatility in output scores. This instability can degrade model reliability and user experience especially in multi-stage systems where consistent predictions are critical for downstream decision making. We introduce Fortress, a general framework for enhancing model stability and accuracy by identifying and pruning features that contribute to inconsistent prediction scores over time. Fortress leverages historical snapshots temporally partitioned datasets capturing score fluctuations for the same entity across periods and follows a four-step process: (1) collect historical snapshots, (2) identify samples with unstable predictions, (3) isolate and remove instability-inducing features, and (4) retrain models using only stable features. While semantic features from LLMs and BERT-based models improve generalization, they often lack full query or entity coverage. Engagement-based features offer strong predictive power but tend to introduce temporal instability. Fortress mitigates this trade-off by suppressing the volatility of engagement signals while retaining their predictive value leading to more stable and accurate models. We validate Fortress on a query-to-app relevance model in a large-scale app marketplace. Offline experiments demonstrate notable improvements in prediction stability (measured by Coefficient of Variation) and classification performance (measured by PR-AUC).
[IR-10] An LLM -RAG Approach for Healthy Eating Index-Informed Personalized Food Recommendations
链接: https://arxiv.org/abs/2605.15213
作者: Yibin Wang,Yanjie Yang,Grace Melo Guerrero,Rodolfo M. Nayga Jr.,Azlan Zahid
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Diet quality is a leading determinant of chronic disease risk. Advances in artificial intelligence (AI) have enabled food recommendation systems to adapt suggestions to user preferences and health goals. However, most current systems rely on loosely curated food databases and provide limited connection to a validated index. In this study, we propose a Healthy Eating Index (HEI) informed retrieval-augmented generation (RAG) framework that combines standardized nutrition databases with large language models (LLMs) for personalized food recommendations. Our proposed method anchors retrieval in the National Health and Nutrition Examination Survey (NHANES) and the Food Patterns Equivalents Database (FPED). A food-level embedding space is constructed from FPED-derived textual descriptions. For each entity, the system computes baseline HEI scores, retrieves candidate foods for intake recommendations, and estimates the HEI impact of simple substitutions or additions. A constrained RAG pipeline instantiated with a pretrained OpenAI LLM generates personalized recommendations and sources based on nutrient profiles and HEI contributions. The simulation results showed a mean HEI improvement of 6.45, with the proportion of users HEI over 50 increasing from 45.12 to 61.26. Quantile analysis revealed consistent improved shifts across the HEI distribution. Our findings suggest that the proposed LLM-RAG-based AI systems can support more precise, explainable, and personalized nutrition guidance to improve diet quality.
[IR-11] DeepSlide: From Artifacts to Presentation Delivery
链接: https://arxiv.org/abs/2605.15202
作者: Ming Yang,Zhiwei Zhang,Jiahang Li,Haoseng Liu,Yuzheng Cai,Weiguo Zheng
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 37 pages,10 figures,9 tables
Abstract:Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide–script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide–script synergy with clearer attention guidance.
人机交互
[HC-0] Inside Baseball: The Automated Ball-Strike System as an Object Lesson in Technological Rule Enforcement
链接: https://arxiv.org/abs/2605.16237
作者: Andrea Wen-Yi Wang,Waki Kamino,David Mimno,Karen Levy,Malte F. Jung
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Clearly-defined rules are often assumed to be straightforward to automate and evaluate. We challenge this assumption through an in-depth study of Major League Baseball’s (MLB) seven-year experimentation with the Automated Ball-Strike System (ABS). ABS is envisioned to call balls and strikes accurately: a seemingly straightforward use of technology to objectively determine the distance between a pitch and the strike zone. Although the strike zone is an area clearly defined in the rulebook, it took MLB seven years to figure out how to automate calling balls and strikes with ABS, showing how even seemingly straightforward rules require a complex translation process to operationalize via technological systems. In this paper, we trace the design decisions that led to the current implementation of ABS. Our case study reveals that “distance” exists even between a clear rule and its technological implementation. Using analytic frameworks from Science and Technology Studies (STS), we show that such distance exists because (1) historically, the “ground truth” of the strike zone is contested: the rule in practice has always reflected a hybrid between the rulebook definition and umpires’ enforcement decisions; and (2) the use of ABS is embedded in an existing eco-system, where the implementation of a technological enforcement system needs to balance multiple stakeholder values. This perspective challenges conventional evaluation paradigms that center on the distance between a formalized rule and its technological implementation, and instead calls for evaluating how such systems are experienced in practice. Addressing this question requires in-depth social science approaches, contributing to ongoing conversations in FAccT about the implementation and evaluation of sociotechnical systems.
[HC-1] Position: AI as Part of Self – Extending the Mind Requires Cognitive Co-Regulation
链接: https://arxiv.org/abs/2605.16197
作者: Alina Gutoreva,Fendi Tsim,Trisevgeni Papakonstantinou
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This position paper argues that safety and alignment cannot be achieved by constraining an external system: they must emerge from the co-regulatory design of the human–AI cognitive system as a whole (“AI as Part of Self”). Contemporary AI increasingly participates in attention allocation, reasoning, synthesis, and decision-making, shaping the very cognitive processes through which humans form beliefs, make decisions, and constitute their sense of self. Humans and AI occupy complementary epistemic roles under mutual constraint, forming a symbiotic cognitive unit whose co-regulation – not the external control of either party alone – is the proper locus of alignment. We identify the risks of unstructured delegation: deskilling, automation bias, transfer of epistemic authority, and oracle-style centralization of knowledge. Drawing on System~0 cognition theory, we further show that AI operates prior to conscious deliberation, shaping the pre-attentive infrastructures through which agency and trust are negotiated – a level that conventional oversight cannot reach. We conclude with design principles for cognitive co-regulation addressed to ML engineers and governance bodies. The goal of this work is to guide human cognition toward resilience and epistemic agency at the foundation of human selfhood.
[HC-2] VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agent ic Tool Invocation
链接: https://arxiv.org/abs/2605.16079
作者: Yiming Zhao,Yu Zeng,Wenxuan Huang,Zhen Fang,Qing Miao,Qisheng Su,Jiawei Zhao,Jiayin Cai,Lin Chen,Zehui Chen,Yukun Qi,Yao Hu,Xiaolong Jiang,Feng Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Project Page: this https URL
Abstract:Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model’s ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.
[HC-3] Driving Through the Network: Performance and Workload Under Latency and Video Impairment
链接: https://arxiv.org/abs/2605.15952
作者: Ines Trautmannsheimer,Ahmed Azab,Frank Diermeyer
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Preprint of VEHITS 2026 : 12th International Conference on Vehicle Technology and Intelligent Transport Systems
Abstract:Teleoperation promises to extend the operational envelope of automated vehicles, yet it critically depends on network latency and video quality. We report a fixed-base driving-simulator study (N=25) with a 2x2 manipulation of added latency (100/300 ms) and bitrate (500/2000 kbit/s), plus a best-case baseline (0 ms added, 9000 kbit/s). We measured effective glass-to-glass (G2G) latency per condition (baseline approx. 413 ms; effective totals approx. 500-700 ms) and verified stable framerate and encoder settings. Multimodal measures covered performance (speed, steering reversals, crashes), oculomotor behavior (blink rate, fixation duration), physiology (RR interval, heart rate, skin conductance), and subjective workload. Latency and bitrate each increased operator load and modestly affected performance. Physiological measures (heart rate, RR interval) exhibited sub-additive interactions, whereas performance and oculomotor interactions were small or non-significant. Equivalence tests showed that 300 ms with 2000 kbit/s was velocity-equivalent to best-case (SESOI +/- 2 km/h), while 300 ms with 500 kbit/s was not. We argue that latency and video quality should be treated as largely independent design levers, and that physiology-aware adaptation can anticipate overload before safety is compromised.
[HC-4] GEMS – Guided Evolutionary Molecule Design for Sustainable Chemicals
链接: https://arxiv.org/abs/2605.15932
作者: Coelina Robinson,Franziska Weissbach,Kjell Jorner,Mennatallah El-Assady,Christina Humer
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Designing safe and sustainable chemicals is critical to combat chemical pollution in our environment. Machine learning (ML) methods have been developed to aid with de novo molecule design. However, data on the environmental impacts of chemical compounds are sparse, resulting in low-fidelity ML oracles and unreliable candidate proposals. Furthermore, generative ML models rely on numerical scoring functions that cannot fully capture the nuanced chemical intuition of expert scientists required for real-world molecular design. We present GEMS-an interactive visual analytics tool that enables domain experts to directly collaborate with a genetic algorithm for molecule design. Users can integrate their expert knowledge to guide the evolutionary process by modifying the scoring function and molecule population without programming knowledge or ML developer support. A usage scenario demonstrates the system’s application in designing sustainable antioxidant alternatives. In an interview session with domain scientists, we collected feedback on the usefulness of GEMS.
[HC-5] Synchronized Realities: Towards Magic Mobile Experiences through Aligned AR
链接: https://arxiv.org/abs/2605.15924
作者: Jan Henry Belz
类目: Human-Computer Interaction (cs.HC)
备注: Part of the CHI 2026 Workshop on Next Steps for Augmented Reality On-the-Move: Challenges and Opportunities
Abstract:In virtual reality environments, the alignment of perceptual modalities is crucial for immersion and presence. In the AR domain, it is difficult to create such alignments because elements in the physical world are often beyond the user’s control. However, recent advances in generative AI enable on-demand content creation, enabling highly reactive AR experiences. Combined with contextual information about the physical world, it has become possible to design experiences that seamlessly align with the user’s environment. In this reflection paper, I emphasize the importance of “synchronized” realities for context-aware AR experiences, particularly in mobility scenarios. I present several examples of existing synchronized experiences and examine their commonalities and distinctions. Finally, I discuss opportunities and pitfalls of synchronizing AR experiences with the physical world.
[HC-6] SLIP ETHICS: Graduated Intervention for AI Emotional Companions ALT
链接: https://arxiv.org/abs/2605.15915
作者: Minseo Kim
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to PervasiveHealth 2026. 11 pages, 2 figures, 4 tables. Proc. of the 20th EAI International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth 2026)
Abstract:AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage graduated methodology deriving interventions (none, soft, hard) from structured qualitative indicators – affect intensity (a) and narrative dynamism (m) – alongside ETHICS (Emergent Taxonomy for Human-AI Interaction Context Signals), a “signals not labels” taxonomy. An evaluation combining a small-scale production deployment (N=68 entries, 10 users, 10 weeks) with a synthetic persona battery (N=91, 5 behavioral-risk profiles) achieved 0% false positives for the flow persona and showed expected escalation patterns in crisis-oriented personas. However, initial results showed that 8 consecutive days of high-energy elevation produced zero interventions (0/8), exposing a boundary where the “do not pathologize” principle conflicts with safety. A subsequent three-model stress test demonstrated that increased model capability improves detection from 0/8 to 6/8 while preserving 0/10 flow false positives in the largest model. Read as preliminary, these findings position graduated intervention as a design direction for navigating – not resolving – the safety-rapport tension in affective computing.
[HC-7] Designing for Robot Wranglers: A Synthesis of Literature and Practice
链接: https://arxiv.org/abs/2605.15892
作者: David Porfirio,Ian McDermott,Hsin-Mei Chen,Satoru Satake,Takayuki Kanda,Thomas D. LaToza
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted for publication in the Proceedings of ACM Designing Interactive Systems (2026)
Abstract:Robots are increasingly present in human spaces, such as for conducting deliveries in hospitals, interacting with visitors at museums, and stocking items in warehouses. To ensure the seamless integration of robots into these spaces, a new role in human-robot interaction is emerging - the robot wrangler, namely an individual who is responsible for setting up, overseeing, and troubleshooting the robot. To understand the needs of this stakeholder, we conducted a scoping review that uncovered a typology of robot wrangling across the research literature, and discovered that wrangling is an umbrella term that collapses a highly complex and heterogeneous space of activities, often rendering this labor difficult to characterize and support. To further clarify and understand robot wrangling, we then reflected on our own firsthand and imagined experiences as robot wranglers within our own respective domains. Guided by the scoping review and our reflections, we devise a series of design implications for supporting wranglers directly as individuals and as members of a wider service ecology.
[HC-8] Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education
链接: https://arxiv.org/abs/2605.15850
作者: Janne Rotter,Pau Benazet i Montobbio,Davinia Hernández-Leo
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:In recent years, generative AI (GenAI) in educational settings has become ubiquitous in students’ daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has thus focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 participants, we compared the agent’s effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, all without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical paradigm that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.
[HC-9] Conversations in Space: Structuring Non-Linear LLM Interactions on a Canvas
链接: https://arxiv.org/abs/2605.15848
作者: Rifat Mehreen Amin,Alperen Adatepe,Daniela Fernandes,Daniel Buschek,Andreas Butz
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:Conversational interfaces powered by large language models (LLMs) are widely used for ideation and analysis, yet their linear structure limits exploration of alternatives and management of long-running interactions. We present CanvasConvo, a conversational interface concept that transforms linear chat into a branching conversation tree embedded in a spatial canvas. CanvasConvo enables users to explore what-if scenarios by branching directly from conversational content, supporting parallel development of alternative directions. These branches are visualized on a canvas while remaining integrated with a familiar chat interface, allowing users to switch between linear and non-linear interaction. Features such as timeline-based navigation, automatic tagging and summarization, and context-aware controls (e.g., goals, reusable prompts) support structured interaction and continuity. We evaluated CanvasConvo in a 5-7 day field study with 24 participants. Our findings highlight how non-linear conversational structures support exploratory workflows and different interactions in LLM-based work.
[HC-10] Which Moments Matter? Heuristics of Remembered Travel Experience in Public Transport
链接: https://arxiv.org/abs/2605.15817
作者: Esther Bosch,Klas Ihme,Stefan Bohmann
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding how travelers form overall evaluations of public transport journeys is critical for improving travel satisfaction and encouraging sustainable mode choice. While travel satisfaction is discussed to influence attitudes and future behavior, the cognitive rules by which moment-to-moment experiences are aggregated into retrospective evaluations remain poorly understood in transport research. Drawing on psychological theories of experienced and remembered utility, this study investigates which temporal aggregation heuristics best predict post-trip travel satisfaction. Using a smartphone-based experience sampling approach, we collected high-frequency on-trip experience ratings and post-trip evaluations for 2576 real-world public transport trips across three German cities. Travel experience was assessed every five minutes during trips using a multi-item scale, allowing direct comparison of competing aggregation rules, including mean experience, peak-end, minimum-end, final moment, and trip duration. Multilevel regression models were estimated to evaluate the explanatory power of each heuristic. Results show that retrospective travel satisfaction is best predicted by a Minimum-End heuristic, combining the most negative moment of the journey and the final experience. Models based on mean experience, peak-end rules, final moment alone, or trip duration performed substantially worse. This pattern indicates that both negative extremes and the final phase of a journey independently contribute to remembered evaluations, rather than overall satisfaction reflecting an average of momentary experiences. The results have important implications for theory and practice, suggesting that targeted interventions at critical negative moments and at trip endings may yield substantial improvements in remembered satisfaction and, ultimately, support shifts toward sustainable mobility. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.15817 [cs.HC] (or arXiv:2605.15817v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.15817 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Esther Bosch [view email] [v1] Fri, 15 May 2026 10:13:56 UTC (645 KB)
[HC-11] oward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling
链接: https://arxiv.org/abs/2605.15812
作者: Feier Qin,Xiao Li,Yi Zheng,Haibin Huang,Hanyao Wang,Xiaoyu Wang,Yan Lu,Yuan Zhang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, published in CHI '26
Abstract:Recent advances in foundation models have enabled conversational agents that aim for sustained companionship rather than mere task completion. Yet most still remain unable to support natural, long-term companion-like interactions, resulting in experiences that feel episodic and inauthentic. We argue that current agents overlooked cross-temporal modeling of agents’ social behaviors and internal emotions: generated behaviors rarely influence an agent’s emotional state, and emotional states seldom shape subsequent behaviors. We present Cross-Temporal Emotion Modeling (CTEM), a framework that links long-term behavioral history to moment-to-moment emotional expression. CTEM establishes a closed loop where past experiences update an evolving emotional state; this state conditions immediate interactions; and user feedback continually revises both memory and emotional state, enabling reflection and anticipation. We instantiate CTEM as Auri, a companion agent on an instant-messaging platform, and report a 21-day in-the-wild study showing that CTEM shows improvements in perceived naturalness, coherence, and emotional harmony.
[HC-12] Separating Acute Psychological Stress from Physical Exertion in Biometric Signals
链接: https://arxiv.org/abs/2605.15756
作者: Esther Bosch
类目: Human-Computer Interaction (cs.HC); Applications (stat.AP)
备注:
Abstract:Acute psychological stress occurs in a wide range of everyday contexts, including transportation, occupational settings, and physical activity, where its reliable detection could enable adaptive system responses and support human well-being. A persistent challenge in automated stress recognition is disentangling the biometric signatures of acute psychological stress from those of concurrent physical exertion. This study examined how five physiological signals (tonic electrodermal activity, trapezius electromyography, heart rate, heart rate variability, and respiration rate) respond to cognitive stress and physical activity, independently and in combination. Nineteen participants completed a 2x3 within-subjects design in which acute psychological stress was induced via an n-back arithmetic task combined with social pressure and financial reward, across three activity conditions: idle sitting, walking, and stationary cycling. Multilevel linear mixed models and repeated-measures ANOVA were used to decompose main effects and interactions for each sensor. Tonic electrodermal activity showed a robust, additive response to both cognitive stress (r=0.48) and physical exertion (r=0.67), with no interaction, making it the most promising candidate for stress detection during physical activity. Heart rate and trapezius electromyography were driven almost exclusively by physical exertion, with no reliable sensitivity to the stress task. RMSSD was strongly suppressed by physical activity and showed only marginal sensitivity to cognitive load. Respiration rate was dominated by physical activity, with no reliable stress effect in the primary analysis. These findings provide a sensor-specific hierarchy for real-world stress detection and highlight tonic electrodermal activity as the most informative channel when cognitive stress must be identified in physically active populations.
[HC-13] Handwriting decoding as a challenging motor task for EEG Foundation Models
链接: https://arxiv.org/abs/2605.15698
作者: Srinivas Ravishankar,Ishayu Ghosh,Nora Zajzon,Teng Fei,Virginia de Sa
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Recent attempts at creating Foundation Models (FMs) for Electroencephalography (EEG) have achieved state-of-the-art performance on multiple tasks including Motor Imagery (MI). These MI tasks have typically involved coarse classification between imagined limb movements. However, the development of foundation models necessitates diverse datasets, both for pretraining and evaluating the progress of these models. In this work, we propose handwriting decoding as a challenging motor task for FMs. We show that several existing datasets are potentially confounded, and introduce a dataset that more rigorously evaluates models. On this dataset, we find that current FMs, despite showing SOTA performance in multiple MI datasets are outperformed by smaller task-specific models. We also highlight challenges specific to EEG-based handwriting decoding to inform future work. In our 4-letter classification task, we show that (a) Knowledge of movement-onset is crucial to reported decoding performance in prior works, with average performance across subjects dropping from 41.3% to 32.4% . (b) Increasing test-time signal quality provides significant performance improvements ( 45% to 78% in our best subject) compared to scaling training data with single-trial EEG. © While scaling training data steadily improves decoding performance, existing FMs do not outperform specialist models in handwriting decoding. We make our code available at this https URL
[HC-14] opoClaw: A Human-Centric and Topology-Aware Agent Operating System
链接: https://arxiv.org/abs/2605.15556
作者: Heyuan Huang,Yeyi Guan,Jihong Wang,Mingzhi Wang,Jiamu Zhou,Xiangmou Qu,Jiaxin Yin,Xin Liao,Xingyu Lou,Jun Wang
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) have evolved AI assistants into autonomous reasoning engines that maintain context, invoke tools, and pursue long-horizon tasks. This has spurred Agent Operating Systems (Agent OS) as kernel-like layers for lifecycle management, memory, scheduling, and access control. Yet most designs remain agent-centric, treating the OS as a single-host runtime for internal reasoning and tool use, leaving open how autonomous actions integrate with distributed, collaborative, permission-sensitive workflows. TopoClaw is an open-source, human-centric, topology-aware Agent OS modeling the user’s ecosystem as two coupled structures: a physical device topology of heterogeneous surfaces and a social relationship topology of shared spaces, teams, and delegated roles. It unifies device operation, messaging, and skills around accountable cross-boundary execution, with three core contributions: (1) cross-device action placement, decoupling intent from actuation and routing distributed actions across the device cluster based on hardware affordances and user context; (2) cross-user identity attribution, treating agents as socially situated “Digital Twins” that coordinate in multi-user spaces while preserving provenance, role-aware permissions, and human accountability; (3) cross-context authority governance, pairing broad capability with distributed, context-aware policy enforcement across physical and social trust boundaries to bound proactive autonomy at the OS layer. This report presents TopoClaw as an engineering-oriented reference architecture, covering its design principles, runtime, cross-device execution, collaboration mechanisms, security model, and deployment outlook. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.15556 [cs.HC] (or arXiv:2605.15556v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.15556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-15] GreenZ: A Sustainable UX Framework for Complex Digital Systems
链接: https://arxiv.org/abs/2605.15468
作者: Trisha Solanki
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 8 pages, 1 figure, 4 tables. Framework preprint. Expert review study underway. v1
Abstract:Digital systems have become simultaneously more powerful and more wasteful. Features accumulate that nobody uses. Data is collected that nobody analyzes. AI is deployed at significant energy and water costs for gains that a simpler approach could have achieved. And through all of it, the people who depend on these systems quietly absorb the consequences in cognitive load, lost time, and eroded trust. This paper introduces GreenZ, a three-layer Sustainable UX Framework for complex digital systems. Its three layers are a Philosophy Layer built around ten published principles, an Operational Frameworks Layer comprising five applied systems, and a Tools and Canvases Layer of practical audit instruments and decision models. Two contributions sit at the framework’s core: a Digital Waste Taxonomy classifying eight distinct waste types, and an AI Sufficiency Decision Model that asks whether AI should exist in a given flow before any question of how to implement it. GreenZ v1 is theoretically grounded but empirically unvalidated. A practitioner expert review study is underway at the time of submission. The paper presents the framework’s architecture, its conceptual foundations, its position relative to existing literature, and an honest account of what remains to be established.
[HC-16] Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift
链接: https://arxiv.org/abs/2605.15455
作者: Sheer Karny,Anthony Baez,Pat Pataranutaporn
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM’s internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression ( R^2 \geq 0.9 ) via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N = 246), participants predicted trait expression from a system prompt alone, then rated observed behavior after interacting with the chatbot for both assistant and role-play personas. We find that participants without visualization struggled to accurately evaluate traits (RMSE \approx 0.6-0.7), while the inclusion of neural transparency significantly improved both anticipation and evaluation compared to no visualization (d = -0.34 to -0.49). The multi-turn dynamic visualization additionally outperformed the static single-turn visualization on holistic evaluation of model behavior (d = -0.32). Transparency also reduced overconfidence: participants without visualization grew more confident despite no gain in accuracy. These findings suggest that surfacing internal model representations to everyday users is a meaningful step toward more transparent and informed human-AI interaction.
[HC-17] Eskwai for Students: Generative AI Assistant for Legal Education in Ghana
链接: https://arxiv.org/abs/2605.15380
作者: George Boateng,Philemon Badu,Patrick Agyeman-Budu,Samuel Ansah,Evans Atompoya,Evan Igwilo,Lord Baah,Frederick Abu-Bonsrah,Victor Wumbor-Apin Kumbol
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Recent advances in generative AI have shown their potential to be leveraged for legal education. Yet, work on the development and deployment of such systems for legal education in the Global South is limited. In this work, we developed Eskwai for Students, a generative AI assistant to help law students with their legal education. Eskwai for Students is a retrieval augmented generation (RAG) system that provides answers to a wide range of legal questions for law students grounded in a curated database of over 12K case laws and 1.4K legislation in Ghana. We deployed Eskwai for Students in a longitudinal study of 30 months (2.5 years) used by 3.1K law students in Ghana who made 32K queries. We evaluated the helpfulness of our AI, and provided insight into the kinds of queries law students submit to this generative AI tool, which raises some ethical concerns. This work contributes to an understanding of how law students in the Global South are using generative AI for their studies and the ways it could be leveraged responsibly to advance legal education.
[HC-18] Overreliance in Writing Tasks: Exploring Similarity-Based Measures of AI Influence on Writing and Proposing a Reflective Writing Interface Intervention
链接: https://arxiv.org/abs/2605.15322
作者: Vitor H. A. Welzel,Nicholas Vincent
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM FAccT 2026; 18 pages, 2 figures
Abstract:As generative AI (GenAI) systems become increasingly proficient at simulating human-like and well-reasoned text, users may attribute authority to AI outputs, shaping how they engage with writing and reasoning tasks. While prior work has raised concerns about AI overreliance, empirical approaches for observing this phenomenon during open-ended writing remain limited. In this paper, we examine how GenAI assistance influences users’ interactions with AI suggestions during writing. We report results from a mixed-methods study in which 47 participants completed analysis and synthesis writing tasks with or without AI assistance. We quantify the textual overlap between AI suggestions and participants’ writing and analyze participants’ reflections. Our results show that AI assistance is associated with patterns of suggestion reuse. Building on these findings, we design and evaluate an interactive writing interface that may support reflection on the usage of the AI suggestions during writing. Evidence from a small follow-up think-aloud study (n = 4) suggests that the interface can increase users’ awareness of how AI outputs are incorporated into their writing and may support more conscious engagement with AI assistance. Together, our findings contribute empirical methods for studying AI adoption in writing contexts and demonstrate how interface design can shape user-AI interaction.
计算机视觉
[CV-0] IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
链接: https://arxiv.org/abs/2605.16258
作者: Yuqi Wu,Tianyu Hu,Wenzhao Zheng,Yuanhui Huang,Haowen Sun,Jie Zhou,Jiwen Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Code: this https URL
Abstract:Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.
[CV-1] Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation
链接: https://arxiv.org/abs/2605.16241
作者: Jin Shi,Brady Zhang,Yishun Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbfVLA-AD, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a 44\times reduction in model size while matching the teacher with only a 0.27% average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a 3.28\times inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different \pi_0.5 -4B teacher, where the student outperforms the teacher on two suites and remains within 0.53% on \textttlibero_goal. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.
[CV-2] Evaluating Design Video Generation: Metrics for Compositional Fidelity
链接: https://arxiv.org/abs/2605.16223
作者: Adrienne Deganutti,Dingning Cao,Jaejung Seol,Elad Hirsch,Purvanshi Mehta
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field.
[CV-3] MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models
链接: https://arxiv.org/abs/2605.16179
作者: Piyush Tiwary,Utkarsh Ahuja,Depanshu Sani,Aishwarya Jayagopal,Sagar Gubbi,Subhashini Venugopalan,Alok Talekar,Vaibhav Rajan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.
[CV-4] Res2CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment
链接: https://arxiv.org/abs/2605.16171
作者: Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP’s inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res ^2 CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP’s residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at this https URL.
[CV-5] Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
链接: https://arxiv.org/abs/2605.16165
作者: Yishun Lu,Wes Armour
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emphML-FOP-SOAP, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to 1.4\times and accelerates wall-clock training by up to 1.5\times , offering a robust optimizer for scaling multimodal foundation models.
[CV-6] Registers Matter for Pixel-Space Diffusion Transformers
链接: https://arxiv.org/abs/2605.16147
作者: Nikita Starodubcev,Ilia Sudakov,Ilya Drobyshevskiy,Artem Babenko,Dmitry Baranchuk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textitregister tokens. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.
[CV-7] STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System ICML2026
链接: https://arxiv.org/abs/2605.16137
作者: Zhen Luo,Yixuan Yang,Xudong Xu,Jinkun Hao,Zhaoyang Lyu,Feng Zheng,Jiangmiao Pang,Yanwei Fu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICML 2026
Abstract:Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs’ inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.
[CV-8] WeatherOcc3D: VLM-Assisted Adverse Weather Aware 3D Semantic Occupancy Prediction
链接: https://arxiv.org/abs/2605.16127
作者: A. Enes Doruk,Abdelaziz Hussein,Hasan F. Ates
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While multi-modal 3D semantic occupancy prediction typically enhances robustness by fusing camera and LiDAR inputs, its effectiveness is fundamentally constrained by environmental variability. Specifically, camera sensors suffer from severe low-light degradation, while LiDAR sensors encounter significant backscatter noise during heavy precipitation. These adverse conditions create a modality trust problem, as static fusion strategies fail to adaptively re-weight inputs when a specific sensor becomes unreliable. To address this, we propose a VLM-assisted framework leveraging the pre-trained CLIP latent space to guide multi-sensor integration via linguistic environmental cues. We utilize a parameter-efficient adapter to align weather-specific text embeddings with sensor features, coupled with a gating strategy that decomposes environmental uncertainty into two factors: visibility and illumination. This enables the model to dynamically modulate the fusion ratio - prioritizing semantic camera features in clear daylight and shifting to geometric LiDAR priors during rainy nights. Evaluations on the nuScenes dataset demonstrate the versatility of our approach, as implementing our proposed framework on the OccMamba and M-CONet architectures achieves mIoU scores of 26.3 and 21.1, respectively, significantly outperforming their traditional baselines.
[CV-9] GenShield: Unified Detection and Artifact Correction for AI-Generated Images
链接: https://arxiv.org/abs/2605.16122
作者: Zhipei Xu,Xuanyu Zhang,Youmin Xu,Qing Huang,Shen Chen,Taiping Yao,Shouhong Ding,Jian Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scale artifact-restored’’ pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at this https URL.
[CV-10] A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
链接: https://arxiv.org/abs/2605.16090
作者: Hao Yang,Zhuo Ma,Yang Liu,Yilong Yang,Guancheng Wang,JianFeng Ma
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model’s interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model’s interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only 10^5 parameters) to the model hidden state space (for multimodal information integration and with 10^7 parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.
[CV-11] MIND: Decoupling Model-Induced Label Noise via Latent Manifold Disentanglement ICML2026
链接: https://arxiv.org/abs/2605.16081
作者: Dayong Ren
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted, to appear in ICML2026
Abstract:The paradigm of learning from automatic annotations driven by pre-trained experts and Foundation Models dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.
[CV-12] ReAlign: Generalizable Image Forgery Detection via Reasoning -Aligned Representation CVPR2026
链接: https://arxiv.org/abs/2605.16080
作者: Qing Huang,Zhipei Xu,Xuanyu Zhang,Xiangyu Yu,Jian Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.
[CV-13] AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification
链接: https://arxiv.org/abs/2605.16076
作者: Salma Hoque Talukdar Koli,Fahima Haque Talukder Jely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96–97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% – a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization – work we have not yet completed.
[CV-14] Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting
链接: https://arxiv.org/abs/2605.16065
作者: Raushan Joshi,Jean-Yves Guillemaut
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE International Conference on Image Processing 2026, 6 pages
Abstract:3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.
[CV-15] EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM -Guided Gaussian Splatting MICCAI2026
链接: https://arxiv.org/abs/2605.16022
作者: Changjing Liu,Yiming Huang,Long Bai,Beilei Cui,Hongliang Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early Accepted by MICCAI 2026
Abstract:In robot-assisted minimally invasive surgery, high-fidelity dynamic endoscopic scene reconstruction and simulation are crucial to enhancing downstream tasks and advancing surgical outcomes. However, existing methods primarily focus on visual reconstruction, lacking physics-based descriptions of the scene required for realistic simulation. We propose a unified framework that achieves physics-aware reconstruction and physical simulation of endoscopic scenes through Multi-modal Large Language Models (MLLMs)-guided Gaussian Splatting. Our approach utilizes 4D Gaussian Splatting (4DGS) integrated with pre-trained segmentation and depth estimation to represent deformable tissues and tools. To achieve automatic inference of physical properties, we introduce an object-wise material field that initializes material parameters via MLLM and refines them through a differentiable Material Point Method (MPM) under joint supervision from rendered images and optical flow. Validated on both open-source and in-house datasets, our framework achieves superior simulation fidelity and physical accuracy compared to state-of-the-art methods, underscoring its potential to advance robot-assisted surgical applications.
[CV-16] End-to-end plaque counting and virus titration from laboratory plate images with deep learning
链接: https://arxiv.org/abs/2605.16008
作者: Eugenia Moris,Alicia Costábile,Sebastián Rey,Irene Ferreiro,Joaquín Hurtado,Lizandra Lissette Luciano,Matías Villagrán,Aisha Espino Vázquez,Jomari Ramos,Isadora Monteiro,María Victoria de Santiago,Pilar Moreno,Gonzalo Moratorio,José Ignacio Orlando
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Plaque assays remain the gold standard readout of virus infectivity; however, plaque counting from plate images is labor-intensive and prone to inter-operator variability. We present an end-to-end, computer-aided workflow for cytopathic effect-based virus titration directly from laboratory plaque assay images. The proposed approach combines two models derived from the Segment Anything Model (SAM): a SAM2-based well-segmentation module that localizes assay wells across heterogeneous imaging conditions, and a SAM-based plaque-segmentation model that detects and enumerates plaques within each well. The method was evaluated on a mixed dataset comprising private plaque assay images of Mayaro virus and Coxsackievirus B3, together with public Vaccinia virus images from the VACVPlaque dataset. The pipeline outputs per-well plaque counts, automatically computes plaque-forming units per milliliter (PFU/mL), and is integrated into a web-based platform that allows users to review results and organize experiments. On held-out plates (17 from MAYV/CVB3 and 22 from VACV), the workflow generalized across two plate formats (6-well and 12-well) and showed strong agreement with manual annotations (Pearson correlation coefficients of 0.92 for MAYV/CVB3 and 0.88 for VACV). Automated plaque counts were further compared with annotations from four independent experts, demonstrating high concordance. The proposed system will be open sourced and publicly released upon acceptance of this manuscript to enable reproducible, scalable, and audit-ready plaque assay analysis while substantially reducing manual annotation effort.
[CV-17] Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
链接: https://arxiv.org/abs/2605.16003
作者: Mingqiang Wu,Weilun Feng,Zhefeng Zhang,Haotong Qin,Yuqi Li,Guoxin Fan,Xiaokun Liu,Zhulin An,Libo Huang,Yongjun Xu,Chuanguang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in this https URL
[CV-18] Segmentation Detection and Explanation: A Unified Framework for CT Appearance Reasoning
链接: https://arxiv.org/abs/2605.15997
作者: Yuyuan Liu,Can Peng,Yingyu Yang,Qianye Yang,Cheng Ouyang,J. Alison Noble
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, submitted to IEEE Transactions on Medical Imaging (TMI)
Abstract:Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a “closer-look” mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.
[CV-19] Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
链接: https://arxiv.org/abs/2605.15980
作者: Xiaoxuan He,Siming Fu,Zeyue Xue,Weijie Wang,Ruizhe He,Yuming Li,Dacheng Yin,Shuai Dong,Haoyang Huang,Hongfa Wang,Nan Duan,Bohan Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO’s effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.
[CV-20] Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
链接: https://arxiv.org/abs/2605.15967
作者: Fabio Rovai
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
备注: 10 pages, 3 figures, 2 tables
Abstract:We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.
[CV-21] WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
链接: https://arxiv.org/abs/2605.15964
作者: Baining Zhao,Jiacheng Xu,Weicheng Feng,Xin Zhang,Zhaolu Wang,Haoyang Wang,Shilong Ji,Ziyou Wang,Jianjie Fang,Zhiheng Zheng,Weichen Zhang,Yu Shang,Wei Wu,Chen Gao,Xinlei Chen,Yong Li
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at this https URL.
[CV-22] Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models
链接: https://arxiv.org/abs/2605.15961
作者: Fabian Morelli,Arnas Uselis,Ankit Sonthalia,Seong Joon Oh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model’s visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: this https URL.
[CV-23] From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding CVPR
链接: https://arxiv.org/abs/2605.15951
作者: Yuyuan Liu,Yiping Ji,Anjie Le,Jiayuan Zhu,Jiazhen Pan,Can Peng,Jiajun Deng,Fengbei Liu,Junde Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Abstract:Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate’s improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at this https URL.
[CV-24] Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation
链接: https://arxiv.org/abs/2605.15942
作者: Chenhao Wang,Yingrui Ji,Yu Meng,Yao Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.
[CV-25] Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction
链接: https://arxiv.org/abs/2605.15923
作者: Chun-Peng Chang,Shaoxiang Wang,Alain Pagani,Dariu Gavrila,Holger Caesar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0% higher mIoU at 3 \times lower resolution and a 20% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45% smaller model size and an average reduction of 40% in input tokens.
[CV-26] AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression ICML2026
链接: https://arxiv.org/abs/2605.15921
作者: Dingming Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026
Abstract:Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current training-free methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.
[CV-27] LoCO: Low-rank Compositional Rotation Fine-tuning IJCAI2026
链接: https://arxiv.org/abs/2605.15916
作者: An Nguyen,Jaesik Choi,Anh Tong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI 2026
Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.
[CV-28] RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations
链接: https://arxiv.org/abs/2605.15908
作者: Yanhao Ge,Shanyan Guan,Weihao Wang,Ying Tai,Mingyu You
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.
[CV-29] A Causally Grounded Taxonomy for Image Degradation Robustness Evaluation
链接: https://arxiv.org/abs/2605.15906
作者: Stefan Becker,Simon Weiss,Wolfgang Hübner,Michael Arens
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image degradations can occur during acquisition, processing, and transmission, altering visual appearance and affecting downstream vision tasks. They are studied in several communities, including synthetic corruption benchmarks for robustness evaluation, perceptual image quality assessment, and physically grounded analyses of imaging systems or real camera failures. Although these areas address closely related phenomena, they often use incompatible grouping schemes and backend specific severity definitions, making results difficult to compare across datasets, degradation sources, and tasks. We propose a causally grounded framework for organizing and interpreting image degradations across these settings. Instead of introducing new degradations or redefining existing benchmarks, we provide an interpretive representation and measurement layer that makes implicit assumptions explicit. Each degradation is described along two orthogonal axes: its dominant causal source in the imaging pipeline (environment, sensor/optics, ISP/renderer/codec, or transfer/system), and its resulting perceptual effect. This dual axis abstraction yields a compact taxonomy spanning algorithmic corruptions, perceptual distortions, and physically motivated imaging artifacts. To address inconsistent severity semantics without changing existing implementations, we introduce a lightweight severity measurement layer. For every degradation and each native severity level of a given backend, we quantify degradation strength using full reference image quality metrics: PSNR, SSIM, and LPIPS. This makes severity observable and comparable across sources while preserving native parameterizations. We demonstrate the framework through COCO Degradation, a taxonomy aligned benchmark for evaluating object detector robustness under diverse imaging conditions. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.15906 [cs.CV] (or arXiv:2605.15906v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.15906 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-30] Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning
链接: https://arxiv.org/abs/2605.15894
作者: Ranjith Chodavarapu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.
[CV-31] FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization
链接: https://arxiv.org/abs/2605.15880
作者: Tingting Liu,Yuan Liu,Guiping Chen,Xiubao Sui,Qian Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all-weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible-light models. Existing infrared colorization methods mainly rely on single-band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single-band frameworks remain limited in modeling spatial-spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral-information-guided GAN framework. Within FSCM, a frequency-enhanced spatial-spectral state-space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state-space modeling captures global spatial-spectral dependencies; the frequency enhancement module (FEM) combines multi-level wavelet decomposition and Fourier gating to recover structural contours, directional high-frequency details, and global frequency responses; and the dual-stream hybrid gating module (DGM) integrates deformation-aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation-guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.
[CV-32] Unlocking Dense Metric Depth Estimation in VLMs
链接: https://arxiv.org/abs/2605.15876
作者: Hanxun Yu,Xuan Qu,Yuxin Wang,Jianke Zhu,Lei ke
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.
[CV-33] SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval ICML2026
链接: https://arxiv.org/abs/2605.15868
作者: Wenjie Yang,Hang Yu,Yuyu Guo,Peng Di
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026
Abstract:In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.
[CV-34] On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry
链接: https://arxiv.org/abs/2605.15860
作者: Michał Król,Michał Salamonowicz,Władysław Skarbek,Michał Tomaszewski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 12 figures, 3 tables
Abstract:Accurate geometric calibration of RGB-thermal infrared (TIR) stereo camera systems is essential for multimodal building envelope analysis, yet remains challenging when low-cost thermal sensors with very low spatial resolution are employed. This paper presents a practical stereo calibration framework for an RGB camera (2028 x 1520 px) paired with a TIR camera operating at only 80 x 62 px - a pixel-count ratio of approximately 1:625. An active OLED screen dynamically switches modality-specific patterns (checkerboard for TIR, ChArUco for RGB) on a single physical surface, providing controlled and repeatable thermal contrast. A dedicated corner detection algorithm combining perspective rectification, Hessian saddle-point analysis, and Mean Shift localisation achieves reliable checkerboard detection at 80 x 62 px without per-frame parameter tuning. A baseline-constrained bundle adjustment enforces physically consistent rig geometry under the planar-calibration-object degeneracy, yielding a stereo baseline of 32.7 mm (nominal 30 mm) with an overall reprojection error of 0.382 px. The system is validated on a thermally active building mock-up using constant-depth and per-pixel depth estimation, demonstrating consistent TIR-to-RGB projection suitable for building energy performance assessment.
[CV-35] Do Less Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
链接: https://arxiv.org/abs/2605.15855
作者: Renye Yan,Jikang Cheng,Shikun Sun,Yi Sun,You Wu,Wei Peng,Zongwei Wang,Ling Liang,Junliang Xing,Yimao Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite strong image-generation performance, diffusion models’ reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare ‘dual benefit’: a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.
[CV-36] GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction
链接: https://arxiv.org/abs/2605.15852
作者: Leyang Chen,Junyi Wu,Zhiteng Li,Yulun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model’s own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at this https URL.
[CV-37] WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes
链接: https://arxiv.org/abs/2605.15843
作者: Jichen Hu,Jiawei Guo,Jiazhong Cen,Chen Yang,Sikuang Li,Wei Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.
[CV-38] Community-aware evaluation and threshold calibration for open-set plankton image recognition
链接: https://arxiv.org/abs/2605.15835
作者: Xi Chen(1),Eryuan Huang(2),Yingjun Xiao(3),Gang Fang(4) ((1) School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China, (2) School of Environment, South China Normal University, Guangzhou, China, (3) School of Artificial Intelligence, Guangzhou University, Guangzhou, China, (4) Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript. 14 figures/tables in total
Abstract:Automated plankton image recognition is increasingly used in aquatic ecosystem monitoring, but deployed classifiers inevitably encounter unseen taxa and non-target particles. Open-set recognition methods are usually evaluated with sample-level metrics such as AUROC, AUPR, and FPR@95% unknown-recall operating points, whereas ecological monitoring depends on community-level estimates of taxon abundance and diversity. This study examines the mismatch between these objectives using controlled pseudo-communities and three datasets spanning marine zooplankton imaged by ZooScan, marine phytoplankton imaged by IFCB, and freshwater plankton imaged by an in-situ camera. We define Open-Set Community Distortion (OSCD), a Bray-Curtis-style error over known taxa plus an unknown bin, with directional components distinguishing known-taxon overestimation from underestimation. Closed-set classifiers achieved high known-class accuracy, but unknown samples were often absorbed with high confidence and in structured ways. Sample-level OOD metrics were not sufficient to select ecological operating points: for MSP, FPR@95% unknown-recall thresholds produced large test-community OSCD on all three datasets mainly because true known taxa were over-rejected into the unknown bin. Community-aware threshold calibration reduced MSP OSCD relative to fixed 95% known recall on SYKE-ZooScan 2024 and SYKE-IFCB 2022; on ZooLake the fixed-recall baseline was already close to the community-aware threshold, and the best community-level method was a prototype-distance variant rather than MSP. The benefit of community-aware calibration therefore depends on validation-community representativeness and the gap between fixed recall and the community optimum. These results show that open-set plankton recognition should be evaluated as an ecological measurement problem, not only as a sample-level detection task.
[CV-39] Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
链接: https://arxiv.org/abs/2605.15828
作者: Yipu Zhang,Jintao Cheng,Weilun Feng,Jiehao Luo,Chuanguang Yang,Zhulin An,Yongjun Xu,Wei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization.
[CV-40] FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
链接: https://arxiv.org/abs/2605.15824
作者: Quanjian Song,Yefeng Shen,Mengting Chen,Hao Sun,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Liujuan Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180 \times faster than existing baselines.
[CV-41] StippleDiffusion: Capacity-Constrained Stippling using Controlled Diffusion
链接: https://arxiv.org/abs/2605.15816
作者: Ofir Gilad,Aleksander Plocharski,Przemyslaw Musialski,Andrei Sharf
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 10 figures
Abstract:Stipple patterns, point sets whose local density tracks a target image, are traditionally produced by per-density iterative optimizers, which are slow, non-differentiable, and must be re-run from scratch for each new target. Learned alternatives have so far addressed only unconditional point generation; capacity-constrained, image-conditioned stippling has remained out of reach. We present the first diffusion-based sampler that simultaneously satisfies a learned local point-distribution prior and a continuous, image-defined capacity constraint at inference. The method is a ControlNet branch built on top of an optimal-transport-grid point-set diffusion baseline, conditioned on the target density map and a high-resolution image. Two design choices make the combination tractable: training and inference are restricted to the late-stage denoising regime, initialized from a density-weighted rejection sample, and the standard zero-convolution injection is replaced with a sigmoid-gated 1x1 projection that preserves the base model’s blue-noise structure under hard density signals. A single trained checkpoint accepts arbitrary target densities at inference, generalizes to point budgets that were not seen during training, and produces stipples in time nearly independent of the output point count. On the Icons-50 benchmark, our learned sampler reaches parity with per-density-optimized baselines on every reported metric while remaining differentiable end-to-end.
[CV-42] Embedding-perturbed Exploration Preference Optimization for Flow Models ICML2026
链接: https://arxiv.org/abs/2605.15803
作者: Sujie Hu,Chubin Chen,Jiashu Zhu,Jiahong Wu,Xiangxiang Chu,Xiu Li
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose \textbfEmbedding-perturbed Exploration Preference Optimization (E^2\textbfPO) , a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.
[CV-43] Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion
链接: https://arxiv.org/abs/2605.15796
作者: Xiongjun Guan,Jianjiang Feng,Jie Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D–3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities.
[CV-44] Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models CVPR2026
链接: https://arxiv.org/abs/2605.15792
作者: Yujun Tong,Dongliang Chang,Zijin Yin,Xintong Liu,Yuanchen Fang,Zhanyu Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Findings
Abstract:The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.
[CV-45] GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
链接: https://arxiv.org/abs/2605.15764
作者: Junho Kim,Xu Cao,Houze Yang,Bikram Boote,Ana Jojic,Fiona Ryan,Bolin Lai,Sangmin Lee,James M. Rehg
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question–answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze–gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
[CV-46] Learn2Splat: Extending the Horizon of Learned 3DGS Optimization
链接: https://arxiv.org/abs/2605.15760
作者: Naama Pearl,Stefano Esposito,Haofei Xu,Amit Peleg,Patricia Gschossmann,Lorenzo Porzi,Peter Kontschieder,Gerard Pons-Moll,Andreas Geiger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at this https URL .
[CV-47] Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models
链接: https://arxiv.org/abs/2605.15755
作者: Cheng Zhang,Yuer Liu,Zhiyu Zhou,Hongxia Xie,Wen-Huang Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at this https URL
[CV-48] Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces
链接: https://arxiv.org/abs/2605.15753
作者: Xinggang Hu,Chenyangguang Zhang,Alexandros Delitzas,Xiangkui Zhang,Marc Pollefeys,Francis Engelmann,Xiangyang Ji
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.
[CV-49] HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
链接: https://arxiv.org/abs/2605.15741
作者: Yu He,Lichen Ma,Zipeng Guo,Xinyuan Shan,Jingling Fu,Dong Chen,Junshi Huang,Yan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental “granularity dilemma”: capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of \mathbf1.56 on ImageNet 256\times256 directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.
[CV-50] BARRIER: Bounded Activation Regions for Robust Information Erasure
链接: https://arxiv.org/abs/2605.15737
作者: Jan Miksa,Patryk Krukowski,Przemysław Spurek,Dawid Damian Rymarczyk,Marcin Sendera
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at this https URL.
[CV-51] BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation CVPR2026
链接: https://arxiv.org/abs/2605.15736
作者: Huanyang Tong,Kai Liu,Fangjun Kuang,Huiling Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2026 Workshop
Abstract:Biomedical Vision–Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textitfragility to prompt variations.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts’'. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal this http URL enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: this https URL. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning Comments: CVPR2026 Workshop Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15736 [cs.CV] (or arXiv:2605.15736v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.15736 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Huanyang Tong [view email] [v1] Fri, 15 May 2026 08:45:57 UTC (1,998 KB) Full-text links: Access Paper: View a PDF of the paper titled BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation, by Huanyang Tong and Kai Liu and Fangjun Kuang and Huiling ChenView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-52] UAM: A Dual-Stream Perspective on Forgetting in VLA Training
链接: https://arxiv.org/abs/2605.15735
作者: Jianke Zhang,Yuanfei Luo,Yucheng Hu,Xiaoyu Chen,Yanjiang Guo,Ziyang Liu,Hongbin Xu,Tian Lan,Jianyu Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision–language–action (VLA) models are typically built by fine-tuning a pretrained vision–language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM’s multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain’s dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over 95% of the underlying VLM’s multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object–target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.
[CV-53] Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model
链接: https://arxiv.org/abs/2605.15733
作者: Tianqiu Zhang,Muyang Lyu,Xiao Liu,Si Wu
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model’s capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.
[CV-54] DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation
链接: https://arxiv.org/abs/2605.15728
作者: Yifan Gao,Lu Zou,Zhangjin Huang,Guoping Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.
[CV-55] DiLA: Disentangled Latent Action World Models
链接: https://arxiv.org/abs/2605.15725
作者: Tianqiu Zhang,Muyang Lyu,Yufan Zhang,Fang Fang,Si Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this http URL
Abstract:Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.
[CV-56] GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective
链接: https://arxiv.org/abs/2605.15723
作者: Xu Wang,Xunkai Li,Yinlin Zhu,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.
[CV-57] Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation
链接: https://arxiv.org/abs/2605.15722
作者: Jeonghwa Lim,Minje Park,Sunghoon Joo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 11 pages, 6 figures, 6 tables
Abstract:Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.
[CV-58] Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment
链接: https://arxiv.org/abs/2605.15720
作者: Yuchen Li,Zhen Zhao,Yi Liu,Luping Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.
[CV-59] EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy
链接: https://arxiv.org/abs/2605.15711
作者: Xuanyu Ge,Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures, 8tables
Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.
[CV-60] 3D Segmentation Using Viewpoint-Dependent Spatial Relationships
链接: https://arxiv.org/abs/2605.15708
作者: Ayaka Nanri(1),Klara Reichard(2,3),Mert Kiray(2,4,5),Federico Tombari(2,6),Benjamin Busam(2,4,5),Asako Kanezaki(1,7,8) ((1) Institute of Science Tokyo, (2) Technical University of Munich, (3) BMW Group, (4) Munich Center for Machine Learning (MCML), (5) Obsphera, (6) Google, (7) Tohoku University, (8) RIKEN)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as “left,” “right,” “front,” and “behind” ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.
[CV-61] How to Choose Your Teacher for Fine Grained Image Recognition CVPR2026
链接: https://arxiv.org/abs/2605.15689
作者: Oswin Gosal,Edwin Arkel Rios,Augusto Christian Surya,Fernando Mikael,Bo-Cheng Lai,Min-Chun Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 3 figures, 4 tables
Abstract:Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbfRatio 1-2, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18% over previous methods, enabling small student models to achieve up to 17% accuracy gains. Experiment codebase is available at: \hrefthis https URLthis https URL.
[CV-62] ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
链接: https://arxiv.org/abs/2605.15684
作者: Kunpeng Du,Haizhen Xie,Sen Lu,Lei Yu,Binglei Bao,Huaao Tang,Chuntao Liu,Hao Wu,Yang Zhao,Zhicai Huang,Heyuan Gao,Zhijun Tu,Jie Hu,Xinghao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.
[CV-63] DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer
链接: https://arxiv.org/abs/2605.15682
作者: Qingji Dong,Hang Dong,Mingqin Chen,Rui Zhang,Yitong Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model’s capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at this https URL.
[CV-64] DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer
链接: https://arxiv.org/abs/2605.15681
作者: Nisha Huang,Yizhou Lin,Jie Guo,Xiu Li,Tong-Yee Lee,Zitong Yu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, diffusion-based material transfer methods rely on image fine-tuning or complex architectures with auxiliary networks but face challenges such as text dependency, additional computational costs, and feature misalignment. To address these limitations, we propose \textbfDealMaTe, using \underline\textbfdepth, norm\underline\textbfal, and \underline\textbflighting images for \underline\textbfmaterial \underline\textbftransf\underline\textbfer. DealMaTe is a simplified diffusion framework that eliminates text guidance and reference networks. We design a lightweight 3D information injection method, Multi-Dim 3D Shader LoRA, which, without modifying the base model weights, enables compatible control conditions and achieves harmonious and stable results. Additionally, we optimize the attention mechanism with Shader Causal Mutual Attention and key-value (KV) caching to reduce inference latency caused by multiple conditions, improve computational efficiency, and achieve high-quality material transfer results with low architectural complexity. Extensive experiments covering a wide variety of objects and lighting conditions consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer under arbitrary input materials. The code is available at this https URL.
[CV-65] VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
链接: https://arxiv.org/abs/2605.15672
作者: Hyesoo Hong,Minsoo Kim,Wonje Jeung,Sangyeon Yoon,Dongjae Jeon,Albert No
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textitline tracing, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.
[CV-66] ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark
链接: https://arxiv.org/abs/2605.15666
作者: Haozhe Si,Yuxuan Wan,Yuqing Wang,Minh Do,Han Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral imaging (HSI) provides dense spectral information for the Earth’s surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA’s EO-1 Hyperion mission, the world’s longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ( \geq 3 observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.
[CV-67] VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation
链接: https://arxiv.org/abs/2605.15661
作者: Yan Luo,Ahmadou Aidara,Jingyi Lu,Jeremy Moebel,Kai Han,Mengyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model’s current dynamics or working against them. We propose \textitVelocity-Adaptive Guidance Scale (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at this https URL.
[CV-68] MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
链接: https://arxiv.org/abs/2605.15660
作者: Nisha Huang,Henglin Liu,Yizhou Lin,Kaer Huang,Chubin Chen,Jie Guo,Tong-Yee Lee,Xiu Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.
[CV-69] Learning Disentangled Representations for Generalized Multi-view Clustering
链接: https://arxiv.org/abs/2605.15640
作者: Xin Zou,Ruimeng Liu,Chang Tang,Zhenglai Li,Xinwang Liu,Kunlun He,Wanqing Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)
Abstract:Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: this https URL.
[CV-70] LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
链接: https://arxiv.org/abs/2605.15621
作者: Hongyu Lu,Feng Zhang,Wenwei Jin,Huanling Hu,Tianjun Shi,Shikai Jiang,Yao Hu,Jiawei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations
Abstract:Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.
[CV-71] Latent Video Prediction Learns Better World Models
链接: https://arxiv.org/abs/2605.15618
作者: Ali J Alrasheed,Aryan Yazdan Parast,Basim Azam,James Bailey,Naveed Akhtar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.
[CV-72] Neutral-Reference Prompting for Vision-Language Models ICML2026
链接: https://arxiv.org/abs/2605.15615
作者: Senmao Tian,Xiang Wei,Shunli Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2026
Abstract:Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model’s surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.
[CV-73] Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study CVPR2026
链接: https://arxiv.org/abs/2605.15599
作者: Alexander Hackett,Srikanth Thudumu,Ginny Fisher,Mahule Roy,Aisha Sartaj,Jason Fisher
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at the 13th Workshop on Fine-Grained Visual Categorization (FGVC13) at CVPR 2026
Abstract:Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.
[CV-74] CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverag e
链接: https://arxiv.org/abs/2605.15597
作者: Jiale Liu,Jungang Li,Jieming Yu,Xinglin Yu,Zihao Dongfang,Zongjian Ding,Kaifeng Ding,Yi Yang,Lidong Chen,Yang Zou,Shunwen Bai,Jiahuan Zhang,Haoran Huang,Shan Huang,Yudong Gao,Mingjun Cheng
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 35 pages including appendix. Code and dataset: this https URL
Abstract:Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.
[CV-75] Efficient Image Synthesis with Sphere Latent Encoder
链接: https://arxiv.org/abs/2605.15592
作者: Tung Do,Thuan Hoang Nguyen,Hao Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.
[CV-76] Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes
链接: https://arxiv.org/abs/2605.15586
作者: Tan-Ha Mai,Chao-Kai Chiang,Han-Hwa Shih,Gang Niu,Masashi Sugiyama,Hsuan-Tien Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 16 figures, 18 tables
Abstract:Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.
[CV-77] See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
链接: https://arxiv.org/abs/2605.15585
作者: Yuejia Li,Ke He,Junheng Li,Shutong Chen,Jingkang Xia,Zhiyue Su,Junchi Zhang,Mang Ye
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures
Abstract:Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.
[CV-78] AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models
链接: https://arxiv.org/abs/2605.15584
作者: Zhiwei Li,Jiacheng Xue,Weining Wang,Ajian Liu,Xingyu Gao,Zhenan Sun,Qi Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4% over state-of-the-art baseline while delivering a 10 \times reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.
[CV-79] Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling
链接: https://arxiv.org/abs/2605.15583
作者: Ryohei Goto,Takuya Fujihashi,Shunsuke Saruwatari,Fumio Okura
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Automatic Face and Gesture Recognition (FG 2026), Oral
Abstract:We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: this https URL.
[CV-80] LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance
链接: https://arxiv.org/abs/2605.15582
作者: Jiaxuan Zhao,Ali Bereyhi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IGARSS 2026. Code is available at: this https URL
Abstract:Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.
[CV-81] MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays
链接: https://arxiv.org/abs/2605.15574
作者: Sunghwan Steve Cho,Yunseok Han,Jaeyoung Do
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages
Abstract:Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at this https URL
[CV-82] RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding
链接: https://arxiv.org/abs/2605.15561
作者: Jiayan Yang,Zhuoyu Wu,Wenqi Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under revision
Abstract:Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.
[CV-83] 3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds
链接: https://arxiv.org/abs/2605.15546
作者: Bingwen Qiu,Yuan Liu,Junqi Bai,Tong Jiang,Ben Liang,Fangzhou Chen,Xiubao Sui,Qian Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM’s linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at this https URL.
[CV-84] SkiP: When to Skip and When to Refine for Efficient Robot Manipulation
链接: https://arxiv.org/abs/2605.15536
作者: Mingtong Dai,Guanqi Peng,Yongjie Bai,Feng Yan,Chunjie Chen,Lingbo Liu,Liang Lin,Xinyu Wu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emphkey steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emphaction relabeling mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbfSkip Policy (SkiP) dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emphMotion Spectrum Keying (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by 15 – 40% while matching or improving success rates across various policy backbones. Project page: \textttthis https URL.
[CV-85] Learning Dynamic Structural Specialization for Underwater Salient Object Detection
链接: https://arxiv.org/abs/2605.15535
作者: Lin Hong,Chenhui Wang,Linan Deng,Yuning Cui,Yu Zhang,Xin Wang,Bojian Zhang,Wenqi Ren,Xingchen Yang,Fumin Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.
[CV-86] uning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance ICIP2026
链接: https://arxiv.org/abs/2605.15533
作者: Song Wu,Xinyu Chen,Qian Wang,Liang Li,Zili Yi,Junlan Feng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICIP 2026
Abstract:Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textittuning-free, instruction-based video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.
[CV-87] Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning ICML2026
链接: https://arxiv.org/abs/2605.15523
作者: Hongxi Li,Tong Wang,Chengjing Wu,Tianbao Liu,Jiangtao Yao,Xiaochao Qu,Xinxiao Wu,Luoqi Liu,Ting Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: \hrefthis https URLthis http URL.
[CV-88] DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments AAMAS2026
链接: https://arxiv.org/abs/2605.15519
作者: Anindya Sarkar,Srikumar Sastry,Aleksis Pirinen,Nathan Jacobs,Yevgeniy Vorobeychik
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 Pages, 12 figures, Accepted to AAMAS 2026
Abstract:Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.
[CV-89] AnyAct: Towards Human Reenactment of Character Motion From Video
链接: https://arxiv.org/abs/2605.15497
作者: Liuhan Chen,Lei Zhong,Jiewei Wang,Qin Shuai,Li Yuan,Leidong Fan,Qing Li,Kanglin Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 12 pages
Abstract:We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.
[CV-90] LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields
链接: https://arxiv.org/abs/2605.15496
作者: Dongjae Lee,Wooseong Yang,Yifu Tao,Maurice Fallon,Ayoung Kim
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at RA-L 2026
Abstract:Neural distance fields offer a compact and continuous representation of 3D geometry, making them attractive for incremental LiDAR mapping. However, their online optimization is vulnerable to catastrophic forgetting, where new observations can degrade previously reconstructed geometry. Replay-based training is commonly used to address this issue, but existing methods typically rely on passive replay buffers and uniform sampling, which can waste memory on redundant observations and under-train poorly constrained regions. We propose LAPS, a replay management framework for incremental neural mapping that improves both replay retention and replay allocation during online updates. LAPS combines reliability-based active pooling to retain reliable historical samples under limited memory with uncertainty-guided active sampling to focus optimization on under-constrained regions. Experiments on synthetic and real-world benchmarks show that LAPS consistently improves reconstruction completeness while maintaining competitive geometric accuracy. On Oxford Spires, it improves recall by 4.66 pp and F1-score by 3.79 pp over PIN-SLAM on the Blenheim Palace 05 sequence. We release our open source implementation at: this https URL.
[CV-91] FLASH: Efficient Visuomotor Policy via Sparse Sampling
链接: https://arxiv.org/abs/2605.15492
作者: Jiaqi Bai,Jindou Jia,Yuxuan Hu,Gen Li,Xiangyu Chen,Tuo An,Kuangji Zuo,Jianfei Yang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures
Abstract:Generative models such as diffusion and flow matching have become dominant paradigms for visuomotor policy learning, yet their reliance on iterative denoising incurs high inference latency incompatible with real-time robotic control. We present Fast Legendre-polynomial Action policy via Sparse History-anchored flow (FLASH Policy), which replaces discrete action-chunk generation with continuous Legendre polynomial trajectory representation. Specifically, by fitting expert demonstrations under sparse temporal sampling, FLASH enables a single inference to cover a significantly extended action horizon. To further accelerate generation, FLASH initiates the flow matching process from history polynomial coefficients rather than uninformative Gaussian noise, shortening the transport distance and enabling accurate single-step inference. Moreover, analytic polynomial differentiation directly provides desired velocity feed-forward signals to the torque controller without numerical approximation. Extensive experiments on five simulated and two real-world manipulation tasks demonstrate that FLASH achieves state-of-the-art success rates ( \ge 92% across all tasks), a per-episode inference time of 31.40,ms (up to 175\times faster than diffusion policies and 18\times faster than prior flow matching policies), up to 4\times faster training convergence than ACT, and 5\times to 7\times reduction in controller tracking error compared to discrete-action baselines.
[CV-92] Learning Normalized Energy Models for Linear Inverse Problems ICML2026
链接: https://arxiv.org/abs/2605.15487
作者: Nicolas Zilberstein,Santiago Segarra,Eero Simoncelli,Florentin Guth
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: ICML 2026
Abstract:Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: (i) the prior density is represented implicitly, and (ii) they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.
[CV-93] When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverag e in Sparse Routing
链接: https://arxiv.org/abs/2605.15484
作者: Libo Sun,Po-wei Harn,Peixiong He,Xiao Qin
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: this https URL
Abstract:Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top- k routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emphcompute-leverage pattern: positive accuracy gaps require a substantial fraction \rho of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ( k \geq 2 ) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top- k – holding architecture, initialization, and \rho fixed – reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: this https URL.
[CV-94] EgoExo-WM: Unlocking Exo Video for Ego World Models
链接: https://arxiv.org/abs/2605.15477
作者: Danny Tran,Roberto Martín-Martín,Kristen Grauman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans’ physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent’s action space – and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.
[CV-95] A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation
链接: https://arxiv.org/abs/2605.15475
作者: Haijian Lai,Bowen Liu,Man Xu,Chan-Tong Lam,João Macedo,Benjamin Ng,Sio-Kei Im
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted for publication in IEEE Transactions on Multimedia
Abstract:We introduce an empowered transposed Fully Connected Weighted (t-FCW) graph representation to embed point clouds into a metric space. While original t-FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t-FCW effective and design a network that uses the empowered t-FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t-FCW. Our analysis reveals that the empowered t-FCW inherits robustness from surface descriptors, provides interpretability through dimension-wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t-FCW can function both as a lightweight standalone baseline and as a complementary plug-in to existing deep models.
[CV-96] Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction
链接: https://arxiv.org/abs/2605.15466
作者: Santosh Kumar Paidi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
Abstract:Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the “static bias” of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ( R^2=0.43 ). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.
[CV-97] Video Models Can Reason with Verifiable Rewards
链接: https://arxiv.org/abs/2605.15458
作者: Tinghui Zhu,Sheng Zhang,James Y. Huang,Selena Song,Xiaofei Wen,Yuankai Li,Hoifung Poon,Muhao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
[CV-98] RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects
链接: https://arxiv.org/abs/2605.15450
作者: Chunming He,Rihan Zhang,Dingming Zhang,Chengyu Fang,Longxiang Tang,Jingjia Feng,Fengyang Xiao,Sina Farsiu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emphheterogeneous decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbfhomogeneous image decomposition via Retinex theory, which factorizes an image into illumination and reflectance components within the \emphsame spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emphnot necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbfDiscriminability Gap Theorem. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground–background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbfRIDE comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.
[CV-99] Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones
链接: https://arxiv.org/abs/2605.15430
作者: Alex Dunnett,Leonie Bottomley,Mirko Kovac,Basaran Bahadir Kocer
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress version accepted to the Recent Advances in Robotic Perception for Forestry
Abstract:This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.
[CV-100] Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models
链接: https://arxiv.org/abs/2605.15424
作者: Po-Chien Luan,Wuyang Li,Yang Gao,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: this https URL
[CV-101] MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes
链接: https://arxiv.org/abs/2605.15423
作者: Luca Bompani,Manuele Rusci,Luca Benini,Daniele Palossi,Francesco Conti
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53% for the CNNs and 32% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at this https URL
[CV-102] U-SEG: Uncertainty in SEGmentation – A systematic multi-variable exploration CVPR
链接: https://arxiv.org/abs/2605.15421
作者: Michael Smith,Frank P. Ferrie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR Findings Track 2026
Abstract:In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.
[CV-103] 3DEditSafe: Defending 3D Editing Pipelines from Unsafe Generation
链接: https://arxiv.org/abs/2605.15398
作者: Nicole Meng,Zheyuan Liu,Meng Jiang,Yingjie Lao
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D generative editing, particularly pipelines based on 3D Gaussian Splatting (3DGS), have achieved high-fidelity, multi-view-consistent scene manipulation from text prompts. However, we find that these pipelines also introduce new safety risks when unsafe prompts produce edits that are propagated and optimized across views. In this work, we study unsafe generation in 3D editing pipelines and show that such behavior can lead to coherent, undesirable Not-Safe-For-Work (NSFW) content in the final 3D representation. To address this, we propose 3DEditSafe, a safety-regularized 3D editing framework that constrains unsafe semantic propagation during optimization. 3DEditSafe combines generation-stage safety guidance with rendered-view 3D safety regularization, safe semantic projection, residue suppression, and mask-aware preservation to steer optimization away from unsafe editing directions. We evaluate our approach on EditSplat scenes using an object-compatible unsafe prompt benchmark and show that 2D safety guidance alone is not consistently sufficient to prevent unsafe 3D edits. 3DEditSafe reduces unsafe semantic alignment and view-level attack success rates, while revealing a safety-quality tradeoff in which stronger unsafe suppression can introduce artifacts or reduce unsafe-prompt fidelity. To our knowledge, this work is the first attempt to study and defend against unsafe generation in text-driven 3D editing pipelines, highlighting the need for safety mechanisms that operate directly on optimized 3D representations. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.15398 [cs.GR] (or arXiv:2605.15398v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2605.15398 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-104] ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest
链接: https://arxiv.org/abs/2605.15397
作者: Kangning Cui,Surendra Bohara,Suraj Prasai,Zishan Shao,Wei Tang,Martin Pillaca,Edwin Flores,Zhen Yang,Gregory Larsen,Evan Dethier,David Lutz,Jean-Michel Morel,Miles Silman,Victor Pauca,Fan Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 70 pages, 35 figures, 28 tables
Abstract:Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.
[CV-105] PanoWorld: Geometry-Consistent Panoramic Video World Modeling
链接: https://arxiv.org/abs/2605.15391
作者: Le Jiang,Xiangyu Bai,Bishoy Galoaa,Shayda Moezzi,Caleb James Lee,Tooba Imtiaz,Edmund Yeh,Jennifer Dy,Yanzhi Wang,Sarah Ostadabbas
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present PanoWorld, a panoramic video world model that generates geometry-consistent 360 \degree video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at this https URL.
[CV-106] MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays
链接: https://arxiv.org/abs/2605.15383
作者: Emre Hayir,Lorin Crawford,Alex X. Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at this https URL.
[CV-107] ChangeFlow – Latent Rectified Flow for Change Detection in Remote Sensing
链接: https://arxiv.org/abs/2605.15375
作者: Blaž Rolih,Matic Fučka,Filip Wolf,Luka Čehovin Zajc
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: this https URL
[CV-108] Discretizing Group-Convolutional Neural Networks for 3D Geometry in Feature Space
链接: https://arxiv.org/abs/2605.15368
作者: Daniel Franzen,Jean Philip Filling,Michael Wand
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 11 pages, 7 figures, 2 tables
Abstract:Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group G densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to G . Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.
[CV-109] Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
链接: https://arxiv.org/abs/2605.15342
作者: Arsha Nagrani,Jasper Uijilings,Shyamal Buch,Tobias Weyand,Sudheendra Vijayanarasimhan,Bo Hu,Ramin Mehran,David A Ross,Cordelia Schmid
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of ‘where’ and ‘when’ to look yields substantial improvements in performance. Minerva-Ego can be downloaded at this https URL.
[CV-110] Multimodal Object Detection Under Sparse Forest-Canopy Occlusion
链接: https://arxiv.org/abs/2605.15326
作者: Nitik Jain,Mangal Kothari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible–thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible–thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of \sim 0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.
[CV-111] COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection
链接: https://arxiv.org/abs/2605.15325
作者: Darryl Cherian Jacob,Xinyu Liu,Kai Wang,Pan He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript currently under review for publication
Abstract:Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at this https URL
[CV-112] FFAvatar: Few-Shot Feed-Forward and Generalizable Avatar Reconstruction
链接: https://arxiv.org/abs/2605.15320
作者: Thuan Hoang Nguyen,Jiahao Luo,Yinyu Nie,Hao Li,Gordon Guocheng Qian,Jian Wang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL
Abstract:Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.
[CV-113] Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA
链接: https://arxiv.org/abs/2605.15312
作者: Sieun Park,Yuanmo He
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures
Abstract:Large-scale facial datasets like CelebA are widely used in computer vision, yet the cultural biases embedded in their labels remain underexplored. Fairness research has distinguished representational from allocational harms, but audits of computer vision datasets have mostly examined categorical labels, leaving open how such harms appear in learned features and model attention. This paper examines CelebA at three levels: dataset structure, learned feature weights, and spatial attention, focusing on how gendered double standards of ageing and beauty are encoded in the data and reproduced in model behaviour. First, hierarchical clustering of 202,599 images shows that the 39 attributes organise into latent trait bundles aligned with cultural archetypes: performative femininity (youth, makeup, adornment) and professional masculinity (ageing, facial hair, formal attire). Female faces, though more often rated attractive overall, incur steep penalties when assigned to ageing or masculine-coded clusters. Second, XGBoost with SHAP analysis reveal gender-specific effects, such as adiposity reducing attractiveness only for females. Third, Grad-CAM finds that predictions for female and younger male subgroups concentrate on mid-face cues, whereas predictions for older males drift toward peripheral cues such as hair and clothing. Older males attain the highest accuracy but the lowest average precision, indicating categorical exclusion of groups outside the dataset’s evaluative templates. Cultural double standards thus pass from media representation into dataset labels, feature weights, and model attention, producing two representational harms: hyper-scrutiny of women under a narrow evaluative template, and exclusion of older men from the scheme entirely. Fairness metrics focused on performance disparities mask both, underscoring the need to address representational harm in fairness research.
[CV-114] One Pass Is Not Enough: Recursive Latent Refinement for Generative Models
链接: https://arxiv.org/abs/2605.15309
作者: Mehdi Esmaeilzadeh,Alexia Jolicoeur-Martineau,Chirag Vashist,Ke Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near-duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single-pass latent mapping in style-based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state-of-the-art approaches while maintaining competitive FID, with improvements across CIFAR-10, CelebA-HQ at 256x256, and nine few-shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2-ADA on CIFAR-10 and AFHQ-v1 at 512x512, demonstrating that the benefit is not specific to IMLE. Unlike flow-matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.
[CV-115] Sound Sparks Motion: Audio and Text Tuning for Video Editing
链接: https://arxiv.org/abs/2605.15307
作者: AmirHossein Naghi Razlighi,Aryan Mikaeili,Ali Mahdavi-Amiri,Daniel Cohen-Or,Yiorgos Chrysanthou
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: Project Page: this https URL
Abstract:Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model’s multimodal conditioning. Code and data are available via our project page: this https URL Comments: Project Page: this https URL Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD) Cite as: arXiv:2605.15307 [cs.GR] (or arXiv:2605.15307v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2605.15307 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-116] Deep Pre-Alignment for VLMs ICML2026
链接: https://arxiv.org/abs/2605.15300
作者: Tianyu Yu,Kechen Fang,Zihao Wan,Kaidong Zhang,Yicheng Zhang,Jun Song,Bo Zheng,Yuan Yao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026. Project Website: this https URL
Abstract:Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\citezhang-etal-2024-investigating,artzy-schwartz-2024-attend on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.
[CV-117] ReactiveGWM: Steering NPC in Reactive Game World Models
链接: https://arxiv.org/abs/2605.15256
作者: Zeqing Wang,Danze Chen,Zhaohu Xing,Zizhao Tong,Yinhan Zhang,Xingyi Yang,Yeying Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL
Abstract:Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.
[CV-118] Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation
链接: https://arxiv.org/abs/2605.15231
作者: Haoran Li,Tobias Lehrer,Yingxue Zhao,Haosu Zhou,Philipp Stocker,Tobias Pfaff,Nan Li
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 15 figures, jounral paper to be submitted
Abstract:Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.
[CV-119] Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features
链接: https://arxiv.org/abs/2605.06475
作者: Ranjith Chodavarapu
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93% of patches within 5 years and 97% within 10 years. Our approach achieves \textbfPICP=92.6%, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2%, 50 passes) and Deep Ensembles (PICP=79.7%, 5 models) at 5\times lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman \rho=0.729 ), and a selective prediction about the most certain 20% of patches can provide \textbf0.5 years MAE. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with \rho=0.905 between uncertainty and page-level error.
[CV-120] DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration CVPR2026
链接: https://arxiv.org/abs/2508.17034
作者: Jiayi Li,Yuxin Yao,Qiuhang Lu,Juyong Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026, Project page: this https URL
Abstract:Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method’s effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: this https URL.
[CV-121] Layer Selection in Feature-Based Losses Affects Image Quality and Microstructural Consistency in Deep Learning Super-Resolution of Brain Diffusion MRI
链接: https://arxiv.org/abs/2605.15895
作者: David Lohr,Rene Werner
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Clinical application of high-resolution diffusion MRI is hindered by hardware limitations and prohibitive scan times, motivating computational super-resolution. This study investigates the efficacy of a feature-based loss function in preserving diffusion signal consistency in deep learning super-resolution. Using 7T data from the human connectome project to generate pairs of low- and high-resolution diffusion weighted images (DWI), we trained UNets for 2D super-resolution. Ablation and isolation studies evaluated different VGG16-layers for feature-based losses against an image-based L1 baseline. Deeper layers and combinations thereof resulted in grid-like artifacts in super-resolution DWIs, which persisted in diffusion parameters like quantitative and fractional anisotropy. No such artifacts were present when using the shallowest layer. Downstream analysis for this layer showed great consistency with the ground truth, even for 9-fold super-resolution. Image SNR and used VGG16-layer depths modulated artifact appearance and severity, mandating careful selection of contributing layers for application in and beyond diffusion MRI.
[CV-122] Evaluation of Anatomical Shape Priors in Deep Learning-Based Cardiac Multi-Compartment Segmentation
链接: https://arxiv.org/abs/2605.15707
作者: Michael Hudler,Franz Thaler,Martin Urschler
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Proceedings of the Third Austrian Symposium on AI, Robotics, and Vision (AIRoV 2026), pp. 23-27
Abstract:Whole-heart multi-compartment CT segmentation is clinically important, but standard CNNs do not explicitly enforce anatomical plausibility. Based on statistics derived from the training data, we evaluate whether lightweight explicit shape priors, implemented as shape-aware losses and spatial label distribution heatmap-guided U-Net variants, improve 3D cardiac segmentation on MM-WHS CT and WHS++. Across all experiments, a standard 3D U-Net surprisingly remained a very strong baseline, with handcrafted priors yielding at best marginal and inconsistent changes and often degrading performance. These results suggest that the baseline already captures substantial implicit anatomical regularities and that future gains will likely require more expressive learned priors rather than simple handcrafted anatomical shape constraints.
[CV-123] Highly Detailed and Generalizable Broadleaf Tree Crown Instance Segmentation from UAV Imagery
链接: https://arxiv.org/abs/2605.15673
作者: Mitsutaka Nakada(1),Takahiko Ikebata(1),Kengo Ikebata(1),Yuji Mizuno(2),Yusuke Onoda(3),Ryuichi Takeshige(3 and 4),Kyaw Kyaw Htoo(3),Kanehiro Kitayama(3 and 5),Robert Ong(6),Masanori Onishi(1 and 3) ((1) DeepForest Technologies Co., Ltd., (2) YM Lab., (3) Graduate School of Agriculture, Kyoto University, (4) Graduate School of Science, Osaka Metropolitan University, (5) Faculty of Tropical Forestry, Universiti Malaysia Sabah, (6) Forest Research Centre)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 3 Tables
Abstract:We present a highly detailed instance segmentation model for delineating individual tree crowns in natural broadleaf forests using aerial imagery acquired by unmanned aerial vehicles (UAVs). Tree crown delineation in broadleaf forests is more challenging than in other forest types due to diversity of crown shapes and the lack of clearly defined treetops. To address this issue, we developed a deep-learning-based crown segmentation model trained on high-quality annotated crown outlines. We manually delineated 18,507 crown polygons from orthomosaic images collected across seven forests in Japan by skilled annotators, and developed a model based on Mask2Former with multiple backbone architectures. The best model achieved high segmentation performance in structurally complex broadleaf forests using only RGB imagery. This performance was maintained when applied to geographically distinct forests within Japan, as well as to biologically distinct tropical rainforests in Borneo. These results demonstrate that using a large number of high-quality annotated datasets is critical for achieving detailed and generalizable crown segmentation across diverse forest ecosystems. The developed model has been integrated into DF Scanner Pro, a software that supports practical forest monitoring using UAVs, and this implementation is expected to enable a wide range of users to analyze tree-level information in broadleaf forest from UAVs.
[CV-124] Degradation-Aware Blur-Segmentation of Brain Tumor
链接: https://arxiv.org/abs/2605.15671
作者: Yuchun Wang,Xiaosong Li,Gefei Liang,Yang Liu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal 3D MRI brain tumor segmentation is a pivotal step in radiotherapy target delineation, surgical planning and post-treatment assessment. Existing methods often assume artifact-free MRI images. However, inevitable patient motion during scanning introduces artifacts and blur that degrade boundary and texture features, leading to poor segmentation performance. To bridge this gap, we introduce Degradation-Aware Blur-Segmentation Net (DABSeg), a synchronous deblurring 3D multimodal MRI segmentation network that unifies blur removal and accurate segmentation. Specifically, we propose a feature-domain motion-deblurring stem to compensate for blur and rebalance intensity. Concurrently, the backbone network embeds a blur-aware cross-modal cross-attention module and multi-scale residual aggregation to yield effective modality complementarity. Notably, we optimize a joint loss that combines weighted Dice with a clear-reference reconstruction term, where imbalanced weights are applied to small targets to boost learning intensity and predictive stability for small lesions and border regions. Systematic comparisons and ablation experiments on the BraTS2020 dataset under both clear and degenerative conditions consistently demonstrate that DABSeg surpasses state-of-the-art methods in tumor Dice score and boundary precision. These results validate the effectiveness of degenerative-aware cross-task collaborative learning in improving the robustness and clinical utility of multi-modal 3D brain tumor segmentation under realistic degenerative conditions. The source code is available at this https URL
[CV-125] VRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling
链接: https://arxiv.org/abs/2605.15579
作者: Xinmin Feng,Li Li,Dong Liu,Feng Wu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing
Abstract:To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at this https URL.
[CV-126] xt-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction
链接: https://arxiv.org/abs/2605.15558
作者: Hao Yang,Xianping Ma,Peifeng Ma,Man-On Pun
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, submitted to ISPRS JPRS
Abstract:High-resolution remote sensing imagery is critical for environmental monitoring, urban mapping, and land cover analysis, but its transmission is often hindered by limited bandwidth and high communication costs. Conventional pipelines transmit full-resolution pixel data, resulting in redundant and inefficient delivery. This paper proposes a text-guided remote sensing image transmission system that replaces complete high-resolution data with low-resolution images accompanied by compact textual descriptions. An onboard text generator produces spatial and semantic summaries, reducing the transmitted data volume to approximately 2% of the original size. For ground-based reconstruction, a text-conditioned image restoration model is introduced, which leverages cross-modal learning to recover fine spatial details and maintain semantic coherence. Experimental results on the Alsat-2B, UC Merced Land Use, and Aerial Image datasets demonstrate that the proposed framework achieves reconstruction PSNRs of 16.36 dB, 26.87 dB, and 27.41 dB, respectively, enabling efficient and information-preserving image transfer for remote sensing applications. The implementation will be made publicly available at \hrefthis https URLGitHub.
[CV-127] DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems
链接: https://arxiv.org/abs/2605.15456
作者: Romario Gualdrón-Hurtado,Roman Jacome,Leon Suarez,Henry Arguello
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 17 pages, 8 figures, 8 tables
Abstract:Solving imaging inverse problems has usually been addressed by designing proper prior models of the underlying signal. However, minimizing the data fidelity term poses significant challenges due to the ill-conditioned sensing matrix caused by physical constraints in the acquisition system. Thus, preconditioning techniques have been adopted in classical optimization theory to address ill-conditioned data-fidelity minimization by transforming the algorithm gradient step to achieve faster convergence and better numerical stability. We extend the preconditioning concept beyond convergence acceleration and use it to improve reconstruction quality. We introduce DIPA: Distilled Preconditioned Algorithms, where a preconditioning operator (PO) is optimized using teacher-guided distillation criteria. Unlike standard model-compression KD, the teacher and student differ by the sensing operators available during reconstruction: the teacher uses a simulated, better-conditioned, and more informative sensing matrix, whereas the student uses the physically feasible sensing matrix. We design different distillation loss functions to transfer different properties of the teacher algorithm to the preconditioned student. The PO can be linear (L-DIPA), allowing interpretability, or non-linear (N-DIPA), parametrized by a neural network, offering better scalability. We validate the proposed PO design across several imaging modalities, including magnetic resonance imaging, compressed sensing, and super-resolution imaging.
[CV-128] Frequency-domain Event-based Imaging for Selective Surveillance
链接: https://arxiv.org/abs/2605.15392
作者: Megan Birch,James Rick,Adrish Kar,Jason Zutty,Joseph L. Greene
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures
Abstract:Event-based cameras (EBCs) are an attractive sensing modality for surveillance due to their reporting of pixel-level radiance changes with microsecond resolution and high dynamic range, enabling motion extraction while suppressing background. Their asynchronous, sparse output, however, necessitate algorithms that identify targets in event-space without processing full frames. We introduce Frequency Rate Information for Event Space (FRIES), a neuromorphic processing framework that detects periodicity in events, such as rotor rotation and mechanical vibrations, to discriminate and monitor man-made objects. FRIES first applies a time gate to suppress background and noise, then aggregates events into a pixel-wise activity (e.g., density) map and clusters pixels into regions-of-interest (ROIs). A localized spectral analysis is applied to each ROI to extract dominant frequencies used to distinguish structured object signatures from unstructured background and noise. Discriminated targets are visualized using a Resonant Time Surface (RTS), a frequency-selective method that weights events by their phase coherence with the extracted frequencies, rewarding in-sync content and suppressing out-of-sync clutter. We demonstrate FRIES and RTS in a controlled indoor experiment to recover the rotational frequency of a mechanical chopper and drone rotors against a moving background. We further test these methods on an outdoor data to detect a hovering drone against a realistic treeline. These preliminary results establish frequency-domain event processing as a promising front-end for selective surveillance in neuromorphic pipelines and a complementary surveillance modality, leveraging the high temporal resolution to enable spectral discrimination.
[CV-129] From Full and Partial Intraoral Scans to Crown Proposal: A Classification-Guided Restoration Assistance Pipeline
链接: https://arxiv.org/abs/2605.15241
作者: Rabin Kunwar,Dikshya Parajuli,Rujal Acharya,Romik Gosai,Prince Panta,Kundan Siwakoti,Shuvangi Adhikari,Saugat Kafley,Louis Digiorgio,Amit Regmi,Akio Tanaka,Masahiko Inada,Yuriko Komagamine,Kennta Kashiwazaki,Manabu Kanazawa
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Single-unit crown restoration is among the most common procedures in clinical dentistry, with CAD/CAM workflows now designing crowns directly from intraoral scans. Partial scans are often preferred over full-arch scans for single-unit cases due to fewer stitching errors, yet most segmentation networks trained on full arches fail on partial scans, while end-to-end generative crown methods often produce over-smoothed surfaces that lose occlusal detail. We propose an end-to-end pipeline that takes a raw intraoral scan and target FDI tooth number as input and outputs an initial, patient-specific crown proposal for clinician refinement. The pipeline has three phases: (I) data preparation and pose standardization; (II) segmentation routed by scan type; and (III) crown proposal generation via context-aware retrieval and Blender-based fitting. We address partial-scan segmentation through a classify-then-align strategy: a DGCNN classifier categorizes the scan into one of five anatomical types, then coarse-to-fine RANSAC+ICP registration standardizes the jaw coordinate frame, followed by graph-cut optimization to refine tooth-gingival boundaries. Trained on 1,958 partial scans, the pipeline achieves macro-average DSC 0.9249, Recall 0.8919, and Precision 0.9615 across 17 semantic classes; a fine-tuned full-arch model reaches DSC 0.9347. The prepared tooth and its mesial and distal neighbors achieve DSC 0.9468-0.9569 with sub-millimeter Centroid Errors (0.2666-0.2774 mm). These centroids anchor a retrieval module using DGCNN embeddings and cosine similarity over neighboring and opposing teeth, followed by spline-guided alignment and Blender Python API refinement. The pipeline produces a preliminary crown shell in 2.5-3.5 minutes, offering a practical alternative to end-to-end generative approaches.
人工智能
[AI-0] Designing Datacenter Power Delivery Hierarchies for the AI Era
链接: https://arxiv.org/abs/2605.16255
作者: Grant Wilkins,Fiodar Kazhamiaka,Alok Gautam Kumbhare,Chaojie Zhang,Ricardo Bianchini
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027. This poses a major challenge for datacenter power delivery designers. As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned. Designs must remain efficient over long datacenter lifetimes and multiple hardware generations. Power utilization is particularly important as grid power capacity is a scarce resource in the AI era. Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix. Moreover, each of these factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis. To address this challenge, we develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences. The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure. Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance, and quantify how rising density from rack- and pod-scale AI systems shapes these outcomes. For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16255 [cs.DC] (or arXiv:2605.16255v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.16255 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-1] Prospective multi-pathogen disease forecasting using autonomous LLM -guided tree search
链接: https://arxiv.org/abs/2605.16238
作者: Sarah Martinson,Michael P. Brenner,Martyna Plomecka,Brian P. Williams,Nicholas G. Reich,Zahra Shamsi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce “cold start” scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.
[AI-2] Formal Methods Meet LLM s: Auditing Monitoring and Intervention for Compliance of Advanced AI Systems
链接: https://arxiv.org/abs/2605.16198
作者: Parand A. Alamdari,Toryn Q. Klassen,Sheila A. McIlraith
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs’ temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.
[AI-3] An Algebraic Exposition of the Theory of Dyadic Morality
链接: https://arxiv.org/abs/2605.16153
作者: Kush R. Varshney
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper provides an algebraic exposition of the theory of dyadic morality (TDM), a psychological model of moral judgment grounded in a simple two-node template: an intentional agent causing harm to a vulnerable patient. We formalize TDM using structural causal modeling (SCM) notation and identify three psychological operators (typecasting operator, completion operator, and valence-dependent inference mechanism) that extend standard SCM to capture how people compute moral judgments under constraints. We address scalability challenges arising from TDM’s dyadic limitation, showing how moral cognition compresses multi-node scenarios through node collapse and sequential processing. Drawing on this algebraic framework, we demonstrate concrete applications to AI policy design: detecting conflicting obligations, structuring helpfulness policies to preserve user agency, and designing post-failure communication as causal interventions. Finally, we recommend scoped, contextual measurement of mind perception over universal averaging to operationalize the theory empirically. This algebraic formalization enables neurosymbolic AI systems to compute morality in a way that is both mathematically rigorous and faithful to human moral cognition.
[AI-4] Property-Guided LLM Program Synthesis for Planning
链接: https://arxiv.org/abs/2605.16142
作者: Augusto B. Corrêa,André G. Pereira,Jendrik Seipp
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.
[AI-5] Surrogate Neural Architecture Codesign Package (SNAC-Pack)
链接: https://arxiv.org/abs/2605.16138
作者: Jason Weitz,Dmitri Demler,Benjamin Hawks,Aaron Wang,Nhan Tran,Javier Duarte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注: 15 Pages, 3 Figures, AutoML (International Conference on Automated Machine Learning) 2026
Abstract:Neural architecture search (NAS) is a powerful approach for automating model design, but existing methods often optimize for accuracy alone or rely on proxy metrics such as bit operations (BOPs) that correlate poorly with hardware cost. This gap is particularly large for FPGA deployment, where cost is dominated by a multi-dimensional budget of lookup tables, DSPs, flip-flops, BRAM, and latency. We present the Surrogate Neural Architecture Codesign Package (SNAC-Pack), an open-source AutoML framework for hardware-aware neural architecture codesign and end-to-end FPGA deployment. SNAC-Pack runs a multi-objective global search with Optuna and NSGA-II, loading trials to a shared SQLite store that enables parallel workers across compute nodes. A hardware surrogate model outputs per-trial resource and latency estimates, avoiding the synthesis cost that would otherwise dominate the search loop. A local search stage then applies quantization-aware training (QAT) together with iterative magnitude pruning in a combined compression loop, after which the final model is synthesized to FPGA firmware via the hls4ml Python library. A YAML configuration and an optional agentic frontend let users run the pipeline on new datasets without modifying the framework. We demonstrate SNAC-Pack on jet classification at the Large Hadron Collider and superconducting qubit readout, discovering compact architectures that match or exceed strong baselines on the task metric while reducing FPGA resource utilization and, in the qubit readout case, reducing the design space exploration process from months of manual fine-tuning to hours of automated search.
[AI-6] Navigating Potholes with Geometry-Aware Sharpness Minimization
链接: https://arxiv.org/abs/2605.16134
作者: Simon Dufort-Labbé,Mehrab Hamidi,Razvan Pascanu,Ioannis Mitliagkas,Damien Scieur,Aristide Baratin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sharpness-aware minimization (SAM) encourages flat minima by perturbing parameters along directions of high loss curvature, but treats all parameter directions uniformly, ignoring the underlying loss geometry. We introduce LLQR+SAM, which combines SAM with a learned preconditioner obtained from the recently proposed LLQR framework, a second-order method that recasts steepest descent as a layerwise linear-quadratic regulator problem. The preconditioner is updated sparsely and maintained as a slow exponential moving average, so it captures a smoothed, low-resolution picture of the loss landscape geometry. The SAM perturbation then operates on top of this learned geometry, probing curvature at a faster timescale. We show that this two-timescale structure is not merely a computational convenience: theoretically, the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable. Empirically, LLQR+SAM gives consistent gains over both SAM and LLQR alone across standard vision and sequence modeling benchmarks, supporting the view that slow learned geometry and fast sharpness correction are genuinely complementary.
[AI-7] Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schrödinger Samplers
链接: https://arxiv.org/abs/2605.16126
作者: Bruno Trentini,Dejan Stancevic,Michael M. Bronstein,Alexander Tong,Luca Ambrogioni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Statistics Theory (math.ST); Other Statistics (stat.OT)
备注:
Abstract:For a fixed flow-based generative model under a small inference budget, sample quality can depend strongly on where the sampler spends its few function evaluations. Flow matching and Schrödinger bridges define probability paths, yet their inference grids are usually heuristic or inherited from one-endpoint diffusion. We derive a conditional-marginal entropy-rate objective for bridge-aware discretization, separating endpoint-conditioned bridge geometry from marginal flow evolution, and use it to build a training-free entropic inference-time scheduler from first principles. For Gaussian Brownian bridges this rate is closed-form and U-shaped, motivating boundary-heavy nonuniform grids. On trained two-dimensional bridge/flow models, the estimated profile recovers the predicted shape and improves 10-step ODE-Heun MMD over linear by 18.1%, with a paired 22.7% SDE-Heun improvement in the same low-NFE sweep. On EDM/CIFAR-10, the entropic time-discretization gives the best tested five-step FID (186.3 \pm 4.0 versus 200.5 \pm 2.9 for linear and 238.0 \pm 5.3 for cosine). On AlphaFlow protein generation, entropic conditional-marginal (cond-marg) scheduling shows advantage in low-NFE regimes on both CAMEO22 and ATLAS benchmarks. These results support entropy-rate scheduling as a practical low-budget allocation signal for high-dimensional bridge and flow samplers.
[AI-8] ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
链接: https://arxiv.org/abs/2605.16116
作者: Chinmay Savadikar,Mingyu Zhao,Yuanzheng Zhu,Han Li,Shuang Xie,Alberto Castelo,Tianfu Wu,Lingyun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 10 figures
Abstract:Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop’s catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.
[AI-9] Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix
链接: https://arxiv.org/abs/2605.16112
作者: Jinhao Zhang,Kangfei Zhao,Qiuhao Zeng,Long-Kai Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer-based architectures have become the dominant paradigm for Continuous-Time Dynamic Graph (CTDG) learning, yet their performance remains limited on temporally shifted datasets. In this work, we identify attention dispersion as a shared failure mode of dynamic graph Transformers under temporal distribution shift. Through controlled ablation contrasting structurally and temporally distinguished historical neighbors against random ones, we show that prediction depends on a class of critical nodes that carry consistently more predictive signal than arbitrary neighbors. However, existing Transformers fail to focus on these nodes even when they are present in the input, as temporal shift weakens attention contrast and produces overly dispersed attention distributions. This diagnosis suggests a simple and transferable fix: replace standard attention with differential attention, which suppresses common-mode attention and amplifies distinctive token-level signals. When added to three representative CTDG Transformer baselines, differential attention consistently improves performance, with gains concentrated on high-shift datasets. Attention-level measurements further confirm the mechanism, showing reduced attention entropy and increased attention mass on critical nodes. Building on these findings, we introduce DiffDyG, a reference implementation combining differential attention with standard input encodings. Across 9 benchmarks and three negative sampling protocols, DiffDyG achieves SOTA performance, with especially large gains on the most shifted datasets.
[AI-10] Sign-Separated Finite-Time Error Analysis of Q-Learning
链接: https://arxiv.org/abs/2605.16103
作者: Donghwan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.
[AI-11] Federated Imputation under Heterogeneous Feature Spaces
链接: https://arxiv.org/abs/2605.16099
作者: Imane Hocine,Chaimaa Medjadji,Sylvain Kubler,Gregoire Danoy,Yves Le Traon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) enables collaborative training across decentralized clients, but most methods assume aligned feature schemas, an assumption that rarely holds in tabular settings where clients observe only partially overlapping feature subsets. In these heterogeneous feature spaces, parameter-averaging methods (e.g., FedAvg) transfer little information across weakly overlapping or disjoint feature groups, limiting their effectiveness for federated imputation. To overcome this, we propose \textbfFedHF-Impute, a federated imputation framework that separates structural feature unavailability from conventional missingness and uses a shared global feature graph to propagate information across statistically related features through message passing. This enables indirect cross-client knowledge transfer, even when features are never jointly observed locally, while preserving standard federated communication. Under simulated partial schema overlap on the SECOM and AirQuality datasets, FedHF-Impute improves imputation accuracy (RMSE) over FL baselines by 26.9%, and 8.4% respectively, while achieving comparable performance on PhysioNET, with only a 0.3% difference relative to the best baseline.
[AI-12] GeoGS-CE: Learning Delay–Beam Channel Priors with 3D Gaussians for High-Mobility Scenarios
链接: https://arxiv.org/abs/2605.16094
作者: Yumeng Zhang,Jiajia Guo,Chaozheng Wen,Chenghong Bian,Jun Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:
Abstract:Wideband channel estimation (CE) in high-mobility scenarios remains challenging because channel responses vary rapidly, while practical systems can allocate only sparse pilots to accommodate dense users. Fortunately, many high-mobility environments, such as high-speed railways, exhibit scheduled trajectories, predictable velocities, and a limited number of dominant propagation paths. These properties induce a delay–beam power spectrum that is more stable than the instantaneous complex channel frequency response (CFR), less sensitive to the random phase coherence, and rich in geometric information. To exploit such environmental properties, we propose GeoGS-CE, a two-stage channel estimation framework for sparse-pilot high-mobility scenarios. In the offline stage, GeoGS-CE jointly models: 1) a scene-level 3D Gaussian representation that captures the non-line-of-sight (NLoS) geometric scattering support, and 2) a leakage-aware differentiable wireless rendering process that maps the NLoS Gaussians, together with an explicit virtual line-of-sight (LoS) component, to the measured delay–beam power spectrum, while accounting for practical OFDM delay and array leakage effects. In the online stage, the delay–beam power spectrum is predicted for each user location and used as a strong covariance prior, enabling accurate full-band and full-array CFR reconstruction and tracking through a linear MMSE estimator. Simulations based on channels generated from a segment of the Guangshen high-speed railway show that the proposed geometric prior substantially improves CFR reconstruction over pilot-only and non-geometric baselines.
[AI-13] Centralized vs Decentralized Federated Learning: A trade-off performance analysis
链接: https://arxiv.org/abs/2605.16089
作者: Chaimaa Medjadji,Guilain Leduc,Sylvain Kubler,Yves Le Traon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) has emerged as a promising paradigm for collaborative model training across distributed edge devices while preserving data privacy especially with the huge increase amount of data due to the adoption of technologies which contributes to the growing number of IoT devices. Storing this amount of data centrally is challenging due to issues like limited communication, privacy, and regulations. FL can be Centralized (CFL), Decentralized (DFL), and Semi-decentralized (SDFL). Choosing the right FL architecture depends on the application’s needs. However, very few research studies have experimentally compared these three types of architectures to not only understand the respective strengths and limitations, but also trade-offs between different performance indicators. This paper overcome this lack of analysis, conducting experimental analyses using the Fedstellar simulator, MNIST dataset, and MLP classifier.
[AI-14] Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction
链接: https://arxiv.org/abs/2605.16088
作者: Xiayu Liu,Zhengyi Lu,Hou-biao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11pages, 4 figures
Abstract:Self-supervised pretraining on molecular graphs has emerged as a promising approach for molecular property prediction, yet most existing methods operate at a single structural granularity and treat bond information as auxiliary edge attributes rather than as an independent semantic layer. In this work, we propose MolCHG, a multi-level self-supervised pretraining framework built upon a novel Compositional Hierarchical Graph that organizes molecular structure into four types of nodes across three semantic levels. By introducing a bond graph that operates in parallel with the atom graph, our architecture elevates bond-level information to independently evolving node representations, enabling fragment nodes to aggregate atom-level and bond-level semantics on an equal footing. We design three level-specific pretraining objectives: an atom-bond cross-view contrastive task that aligns the atom-view and bond-view representations within each fragment, a fragment-level functional group prediction task to inject domain-relevant chemical knowledge, and graph-level structure prediction tasks to encode global molecular topology. Experiments on nine MoleculeNet benchmarks demonstrate that MolCHG achieves the best performance on seven datasets across both classification and regression tasks, remaining competitive with the strongest baselines on the rest. Ablation studies further confirm that the multi-level supervision signals are complementary and that each component contributes to the overall performance.
[AI-15] owards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment ITSC2026
链接: https://arxiv.org/abs/2605.16087
作者: Till Beemelmanns,Shayan Sharifi,Manas Mehrotra,Ayushman Choudhuri,Lutz Eckstein
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication at IEEE ITSC 2026
Abstract:Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at this https URL .
[AI-16] owards Foundation Models for Relational Databases with Language Models and Graph Neural Networks ESWC2026
链接: https://arxiv.org/abs/2605.16085
作者: Jingcheng Wu,Ratan Bahadur Thapa,Mojtaba Nayyeri,Lucas Etteldorf,Max Finkenbeiner,Fabian Leeske,Steffen Staab
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 4 tables. Preprint of a paper accepted at the 1st Workshop on Extraction from Triplet Text-Table-Knowledge Graph and associated Challenge (TRIPLET), co-located with ESWC 2026
Abstract:Relational databases store much of the world’s structured information, and they are essential for driving complex predictive applications. However, deep learning progress on relational data remains limited, as conventional approaches flatten databases into single tables via manual feature engineering, discarding relational context. Relational deep learning (RDL) addresses this by modeling databases as relational entity graphs (REGs) for graph neural networks (GNNs), but remains task- and database-specific. To combine the strengths of both paradigms, we propose a hybrid architecture combining a fine-tuned BART encoder to capture intra-row semantics with a GraphSAGE-based GNN over REGs to inject relational context. Experiments on RelBench show that the GNN substantially enriches BART’s row embeddings, achieving a ROC-AUC of 67.40 on the driver-dnf task from the rel-f1 dataset. This performance is competitive with supervised baselines such as LightGBM (68.86) and narrows the gap to RDL (72.62) to within 5.22 points, though a substantial gap remains to state-of-the-art foundation models such as KumoRFM (82.63). These results suggest that lightweight hybrid LM-GNN architectures offer a promising and resource-efficient path towards foundation models for relational databases.
[AI-17] Misspecified Explore-then-Exploit Leads to Supra-Competitive Prices
链接: https://arxiv.org/abs/2605.16064
作者: Jackie Baek,Vivek F. Farias,Farrell Wu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注:
Abstract:We study whether simple algorithmic pricing systems can systematically produce collusive-like prices in multi-firm markets. We consider firms using an explore-then-exploit pipeline: they randomize prices during an initial exploration phase, then estimate demand from their own historical data and set prices myopically thereafter. The estimation step relies on a misspecified, monopoly-style model that omits competitors’ prices. We characterize when this pipeline converges to supra-competitive prices above the Nash equilibrium, via a fluid-limit ordinary differential equation analysis. We show that supra-competitive prices arise when firms explore within similar price ranges on the same side of the Nash price. Moreover, prices can be substantially above the Nash price; we show that prices can reach monopoly levels under symmetric exploration. Simulations calibrated to a real multifamily rental market confirm that supra-competitive outcomes arise robustly beyond our theoretical assumptions, including under finite horizons, heterogeneous products, and nonlinear logit demand.
[AI-18] Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making ICLR2026
链接: https://arxiv.org/abs/2605.16054
作者: Fan Feng,Selena Ge,Minghao Fu,Zijian Li,Yujia Zheng,Zeyu Tang,Yingyao Hu,Biwei Huang,Kun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Recent work has framed decision-making as a sequence modeling problem using generative models such as diffusion models. Although promising, these approaches often overlook latent factors that exhibit evolving dynamics, elements that are fundamental to environment transitions, reward structures, and high-level agent behavior. Explicitly modeling these hidden processes is essential for both precise dynamics modeling and effective decision-making. In this paper, we propose a unified framework that explicitly incorporates latent dynamic inference into generative decision-making from minimal yet sufficient observations. We theoretically show that under mild conditions, the latent process can be identified from small temporal blocks of observations. Building on this insight, we introduce Ada-Diffuser, a causal diffusion model that learns the temporal structure of observed interactions and the underlying latent dynamics simultaneously, and furthermore, leverages them for planning and control. With a modular design, Ada-Diffuser supports both planning and policy learning tasks, enabling adaptation to latent variations in dynamics, rewards, and latent actions. Experiments on simulated control and robotic benchmarks demonstrate its effectiveness in accurate latent inference and adaptive policy learning.
[AI-19] Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification
链接: https://arxiv.org/abs/2605.16048
作者: Mónika Farsang,Ramin Hasani,Daniela Rus,Radu Grosu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with k parameters iterated L times consistently closely matches or outperforms a standard SSM with k \cdot L independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.
[AI-20] XSearch: Explainable Code Search via Concept-to-Code Alignment ISSTA2026
链接: https://arxiv.org/abs/2605.16046
作者: Yiming Liu,Ruofan Liu,Yun Lin,Zicong Zhang,Weiyu Kong,Pengnian Qi,Xiao Cheng,Weinan Zhang,Qianxiang Wang,Linpeng Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to ISSTA 2026
Abstract:Semantic code search has been widely adopted in both academia and industry. These approaches embed natural-language queries and code snippets into a shared embedding space and retrieve results based on vector similarity. Despit strong performance on benchmark datasets, they often suffer from poor explainability and generalization. Retrieved code may appear semantically similar yet miss critical functional requirements of the query, while providing no explanation of why the result was retrieved. Moreover, such failures become more severe under distribution shift, where models struggle to generalize to unseen benchmarks. In this work, we propose XSearch, an intrinsically explainable code search framework. Our key insight is that by relying on global embedding similarity, existing retrievers inherently take an inductive view. They learn statistical patterns rather than truly understanding the query’s functional requirements. We address this problem by reformulating code search as a deductive concept alignment problem. XSearch (i) identifies functional concepts in the query and (ii) explicitly aligns them with corresponding code statements. This explain-then-predict design produces inherent concept-level explanations and mitigates shortcut learning that harms out-of-distribution generalization. We train an encoder with explicit concept-alignment objectives and perform retrieval through explicit matching between query concepts and code statements. Experiments show that, trained on CodeSearchNet using GraphCodeBERT (125M parameters), XSearch improves performance on out-of-distribution benchmarks from 0.02 to 0.33 (15x) over eight state-of-the-art retrievers, and consistently outperforms both encoder- and decoder-based baselines with up to 7B parameters. A user study demonstrates that concept-alignment explanations enable users to evaluate retrieved results faster and more accurately.
[AI-21] Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data ICRA2026
链接: https://arxiv.org/abs/2605.16043
作者: Gina Wigginghaus,Tim Missal,Berk Guler,Simon Manschitz,Jan Peters
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to the Beyond Teleoperation Workshop at ICRA 2026, 5 pages, 2 figures
Abstract:Deformable Linear Objects (DLOs) such as ropes and cables are widely encountered in both household and industrial applications, yet remain challenging to manipulate due to their infinite-dimensional configuration space and frequent self-occlusion. Imitation learning from teleoperation offers a practical path to bimanual DLO manipulation, but its scalability is limited by human effort, making the choice of observation space critical for generalization from small datasets. In this study, we investigate whether the lack of generalization in egocentric visual policies for the knot-untangling task stems from the observation space itself, rather than from the policy architecture or data scale. We compare two Action Chunking with Transformers policies trained on the same bimanual teleoperation data: a vision-based policy conditioned on two egocentric RGB streams from wrist-mounted cameras, and a state-based policy conditioned on the DLO’s 3D particle state, extracted from an initial observation via multi-view fusion and evolved in a particle-based eXtended Position-Based Dynamics simulation. Evaluated open-loop on an unseen rope configuration, the state-based policy outperforms its visual counterpart with a 30.8% reduction in L1 error when predicting the initial grasp-and-pull action, quantifying the observability gap between pixels and physics-consistent state, and pointing toward more data-efficient robot learning for the DLO manipulation task from limited human demonstrations.
[AI-22] ScreenSearch: Uncertainty-Aware OS Exploration
链接: https://arxiv.org/abs/2605.16024
作者: Michael Solodko,Justin Wagle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures, 4 tables
Abstract:Desktop GUI agents operate under partial observability: visually similar screens can correspond to different underlying workflow states, so locally plausible actions can lead to sharply different outcomes. We frame this as a problem of computer/OS state exploration, where effective behavior requires both expanding the reachable frontier and reducing ambiguity before committing. We present ScreenSearch, a system that combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit for large-scale desktop exploration. The retrieval layer converts UIA trees into location-aware structural features, indexes related screens through sparse token search and metadata filters, and maintains a shared deduplicated state graph across VM workers. On top of this graph, we define a scalable ambiguity signal based on matched-action outcome dispersion. If similar screens produce different next states under the same action signature, the state should be probed further rather than treated as resolved. We use this signal together with frontier rewards to drive large-scale exploration and replay-start policy evaluation over the shared graph. Across 11 desktop applications, ScreenSearch collects over 1M screenshots and over 30K deduplicated states, yielding large exploration corpora with substantial cross-application and within-application diversity. On a fixed replay-start slice, we observe a clear novelty–ambiguity trade-off: some policies reduce ambiguity quickly while discovering little frontier. Ambiguity reduction alone is therefore not a sufficient exploration objective. Appendix ablations show that stronger proposal priors can materially improve unique-state discovery during corpus building. These results suggest that state identity, proposal quality, and ambiguity-aware search all matter when deciding when to probe and when to commit.
[AI-23] CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity
链接: https://arxiv.org/abs/2605.16000
作者: Gowrika Mahesh,Budanur Madappa Darshan Gowda,Kavana Gopladevarahalli Papegowda,Prajwal Basavaraj,Binh Vu,Swati Chandna,Mehrdad Jalali
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 30 pages, 5 main figures, 3 tables, appendices with interface screenshots and implementation details; pilot-stage framework and single-manuscript validation study
Abstract:Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen’s kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.
[AI-24] Constrained latent state modeling: A unifying perspective on representation learning under competing constraints
链接: https://arxiv.org/abs/2605.15995
作者: Gwenolé Quellec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Resources and model cards: this https URL
Abstract:Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct – and often implicit – assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties – predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints – and show that they are intrinsically coupled through fundamental trade-offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade-offs, and to guide the development of more interpretable, robust, and task-aligned latent state models.
[AI-25] Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues
链接: https://arxiv.org/abs/2605.15984
作者: Zhongjie Ba,Liang Yi,Peng Cheng,Qingcao Li,Qinglong Wang,Li Lu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic this http URL address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources – distinguishing between textual content and paralinguistic origins – for comprehensive toxic speech this http URL, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss this http URL experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.
[AI-26] Petri Net Induced Heuristic Search for Resource Constrained Scheduling
链接: https://arxiv.org/abs/2605.15983
作者: Ido Lublin,Dor Atzmon,Izack Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the International Symposium on Combinatorial Search (SoCS 2026)
Abstract:We formulate the Resource-Constrained Project Scheduling Problem (RCPSP) as optimal search over the reachability graph of a Timed Transition Petri Net with Resources, using relative-delay tokens so that scheduling decisions correspond to transition firings in the induced state space. We solve the resulting problem with A^* guided by a heuristic that combines Critical Path and resource-based lower bounds, and prove that it is consistent under our token-based time semantics. Experiments on the PSPLIB benchmarks show that the approach outperforms strong exact Mixed-Integer Linear Programming (MIP) baselines (SCIP, CBC) in both success rate and solve time. Per-instance analysis shows that heuristic search and MIP degrade along independent axes, resource tightness for A^* and formulation size for MIP, with resource strength mediating which solver benefits from scale.
[AI-27] Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
链接: https://arxiv.org/abs/2605.15975
作者: Dillon Z. Chen,Till Hofmann,Toryn Q. Klassen,Sheila A. McIlraith
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emphbilevel policies of the form (\pi^\mathrmhl, \pi^\mathrmll) , consisting of a neural policy \pi^\mathrmll learned from LL demonstrations, and an HL symbolic policy \pi^\mathrmhl that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON’s HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: this https URL
[AI-28] PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
链接: https://arxiv.org/abs/2605.15963
作者: Jingxuan Wei,Xi Bai,Shan Liu,Caijun Jia,Zheng Sun,Xinglong Xu,Siyuan Li,Linzhuang Sun,Bihui Yu,Conghui He,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 11 figures, 3 tables
Abstract:Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
[AI-29] Imperfect World Models are Exploitable
链接: https://arxiv.org/abs/2605.15960
作者: Logan Mondal Bhamidipaty(University of Edinburgh),Esmeralda S. Whitammer(University of Edinburgh),David Abel(University of Edinburgh),Mykel J. Kochenderfer(Stanford University),Subramanian Ramamoorthy(University of Edinburgh)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures, 2 tables
Abstract:We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment’s true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.
[AI-30] When and Why Adversarial Training Improves PINNs: A Neural Tangent Kernel Perspective
链接: https://arxiv.org/abs/2605.15959
作者: Yuan-dong Cao,Chi Chiu SO,Jun-Min Wang,He Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-informed neural networks (PINNs) are powerful surrogates for differential equations but are notoriously difficult to train due to spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. Adversarial training based on generative adversarial networks (GANs) has recently gained surprisingly strong empirical results in improving training, but the underlying mechanisms remain elusive. To this end, we propose a new analysis framework for adversarially trained PINNs, based on the key observation of how the discriminator in GANs can influence the training dynamics of PINNs. The framework first provides a much needed theoretical grounding to why and when adversarial training is effective in PINNs, then presents a unified analysis of GANs variants in such training, and finally leads to a new, practical, efficient training algorithm for PINNs. Empirical results demonstrate that our method can significantly reduce the pathology of PINNs training, thereby providing better models with superior performances, often several magnitudes more accurate than alternative methods.
[AI-31] CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts IJCAI2026
链接: https://arxiv.org/abs/2605.15888
作者: Peiyuan Li,Yongqi Huang,Jitao Zhao,Dongxiao He,Di Jin,Weixiong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted by IJCAI 2026, 9 pages, 4 figures
Abstract:Heterogeneous Graph Prompt Learning (HGPL)has emerged as a promising paradigm for bridging the gap between the objectives of pre-training foundation models and their downstream applications in heterogeneous graph settings. However, existing HGPL methods are primarily designed for in-domain scenarios, whereas real-world deployments often span multiple domains, and the data used for pre-training and downstream tasks may originate from different distributions. Consequently, the applicability of current HGPL approaches is limited to in-domain settings, and their performance typically degrades when application domains shift. To address this serious limitation, we develop CHoE, a cross-domain HGPL method built upon an expert network. During pre-training, we introduce and train structure-conditioned experts, and during prompt tuning, we adopt a structure-aware expert routing and load balancing mechanism to select structurally compatible experts for each meta-path view. In addition, we design a prompt-based semantic fusion module to integrate representations across multiple views for downstream prediction. Extensive experiments show that CHoE consistently improves performance in few-shot cross-domain applications, outperforming all baseline approaches.
[AI-32] Shapley Neuron Values for Continual Learning: Which Neurons Matter Most? ICML2026
链接: https://arxiv.org/abs/2605.15877
作者: Mohammad Ali Vahedifar,Abhisek Ray,Qi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to ICML 2026
Abstract:Continual learning enables neural networks to learn tasks sequentially without forgetting previously acquired knowledge. However, neural networks suffer from catastrophic forgetting, where learning new tasks degrades performance on earlier ones. We address this problem with Shapley Neuron Valuation (SNV), a principled framework that quantifies Neuron importance in continual learning, grounded in cooperative game theory. SNV selectively freezes important Neurons while keeping others plastic, enabling buffer-free continual learning without expanding architecture. Experiments on ImageNet-1k show that SNV consistently outperforms existing buffer-free methods. In particular, SNV improves accuracy by +2.88% in the class incremental learning and +6.46% in the task incremental learning scenarios compared to the second baseline.
[AI-33] Agent ic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
链接: https://arxiv.org/abs/2605.15871
作者: Alberto Pepe,Chien-Yu Lin,Despoina Magka,Bilge Acun,Yannan Nellie Wu,Anton Protopopov,Carole-Jean Wu,Yoram Bachrach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 55 pages, 28 figures, 21 tables
Abstract:Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer’s best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer’s best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.
[AI-34] RoadmapBench: Evaluating Long-Horizon Agent ic Software Development Across Version Upgrades
链接: https://arxiv.org/abs/2605.15846
作者: Xinbo Xu,Ruihan Yang,Haiyang Shen,Wendong Xu,Bofei Gao,Ruoyu Wu,Kean Shi,Weichu Xie,Xuanzhong Chen,Ming Wu,Jason Zeng,Michael Heinrich,Elvis Zhang,Liang Chen,Kuan Li,Baobao Chang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages, 15 figures
Abstract:Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.
[AI-35] GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks
链接: https://arxiv.org/abs/2605.15836
作者: Davide Buoso,Andrea Protopapa,Stefano Di Carlo,Francesca Pistilli,Giuseppe Averta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project webpage at this https URL
Abstract:Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.
[AI-36] Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
链接: https://arxiv.org/abs/2605.15831
作者: Yuqing Cheng,Xingyu Ma,Guochen Yu,Xiaotao Gu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.
[AI-37] Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets
链接: https://arxiv.org/abs/2605.15787
作者: Kai Hidajat,Solden Stoll,Joseph An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.
[AI-38] A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking
链接: https://arxiv.org/abs/2605.15779
作者: Jianlin Ye,Christos Kyrkou,Panayiotis Kolios
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi-UAV Fields of View (FOV). While state-of-the-art frameworks excel in optimizing local trajectory extraction and stability for single-drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network-level analysis such as Origin-Destination estimation. This paper presents a real-time Multi-Camera Multi-Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance-based Re-Identification (Re-ID) in nadir views, we introduce a lightweight Topology-Based Spatiotemporal Handover mechanism. We implement a high-throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue-based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re-ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at this https URL.
[AI-39] SaaS-Bench: Can Computer-Use Agents Leverag e Real-World SaaS to Solve Professional Workflows?
链接: https://arxiv.org/abs/2605.15777
作者: Kean Shi,Zihang Li,Tianyi Ma,Zengji Tu,Jialong Wu,Xinbo Xu,Qingyao Yang,Ruoyu Wu,Weichu Xie,Ming Wu,Jason Zeng,Michael Heinrich,Elvis Zhang,Liang Chen,Kuan Li,Baobao Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures
Abstract:Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at this https URL for reproduction.
[AI-40] Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics
链接: https://arxiv.org/abs/2605.15769
作者: K. Ege de Bruin,Kyrre Glette,Kai Olav Ellefsen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The co-optimization of a robot’s body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.
[AI-41] ALSO: Adversarial Online Strategy Optimization for Social Agents
链接: https://arxiv.org/abs/2605.15768
作者: Xiang Li,Liping Yi,Mingze Kong,Min Zhang,Zhongxiang Dai,QingHua Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbfALSO (\textbfAdversarial on\textbfLine \textbfStrategy \textbfOptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.
[AI-42] Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLM s in Operational Environments
链接: https://arxiv.org/abs/2605.15734
作者: Izabella Krzeminska,Michal Butkiewicz,Ewa Komkowska
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Full survey article with data tables for futher possible replicabilty and comparison
Abstract:The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time. Comments: Full survey article with data tables for futher possible replicabilty and comparison Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T05 (Primary), 62H20 (Secondary), 91c05 (Secondary), 62P15 (Secondary) ACMclasses: H.5.2; H.1.2; I.2.7; I.2.11; I.5.3; J.4; G.3 Cite as: arXiv:2605.15734 [cs.AI] (or arXiv:2605.15734v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.15734 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Izabella Krzeminska Dr [view email] [v1] Fri, 15 May 2026 08:43:26 UTC (2,921 KB)
[AI-43] Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation
链接: https://arxiv.org/abs/2605.15714
作者: Sunil Kothari,Sumukha Sharma Thoppanahalli Chandramouli,Naman Khandelwal,Parth Kulshreshtha,Ashi Jain,Kriti Banka,Tanuja Chintada,Venkata Triveni,Gulipalli Praveen Kumar,Manish Mehta,Tao Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:This position paper argues that the machine learning community should prioritize early-stage quality assurance in annotation pipelines over the prevailing practice of late-stage validation. Data quality bottlenecks increasingly limit foundation model improvement, yet quality assurance research focuses almost exclusively on validation methods rather than validation timing. When validation occurs, not merely what methods are employed, fundamentally determines both error rates and annotation costs. This temporal neglect is puzzling given the well-established “shift-left” principle from software engineering, where empirical studies demonstrate 4–100x cost multipliers for defects detected in later stages (Boehm, 1981; Shull et al., 2002). Annotation pipelines exhibit analogous dynamics: errors caught before annotation begins cost a fraction of those discovered after review cycles complete. We propose a taxonomy of three QA trigger points, namely pre-annotation (T0), post-annotation (T1), and post-review (T2), that decompose annotation workflows into discrete validation opportunities. A parametric error-propagation model formalizes when timing affects final error rates versus only economics, making timing a measurable design variable rather than a configuration afterthought. A survey of 47 recent papers reveals that only 4% report when validation occurs, a striking gap given timing’s demonstrated impact in adjacent fields. Without explicit attention to QA timing, the community risks optimizing validation methods while ignoring the structural variable that may matter most. Acting on this position requires three steps: researchers should report QA timing configurations alongside validation methods; annotation platforms should expose timing as a first-class parameter; and the community should run controlled experiments that measure stage-specific detection rates directly.
[AI-44] Learning Dynamic Pick-and-Place for a Legged Manipulator
链接: https://arxiv.org/abs/2605.15713
作者: Moonkyu Jung,Jiseong Lee,Zhengmao He,Donghoon Youm,Juhyeok Mun,HyeongJun Kim,Hyunsik Oh,Donghyuk Choi,Jungwoo Hur,Jie Song,Jemin Hwangbo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Robotics and Automation Letters 2026
Abstract:Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.
[AI-45] Feedback World Model Enables Precise Guidance of Diffusion Policy
链接: https://arxiv.org/abs/2605.15705
作者: Tuo An,Jindou Jia,Gen Li,Jingliang Li,Chuhao Zhou,Pengfei Liu,Bofan Lyu,Jiaqi Bai,Xinying Guo,Geng Li,Jianfei Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures
Abstract:World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.
[AI-46] Interaction-Aware Influence Functions for Group Attribution
链接: https://arxiv.org/abs/2605.15675
作者: Jaeseung Heo,Kyeongheung Yun,Youngbin Choi,Sehyun Hwang,Jungseul Ok,Dongwoo Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples’ effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.
[AI-47] PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI
链接: https://arxiv.org/abs/2605.15665
作者: Keshava Chaitanya,Jahnavi Gundakaram
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 5 tables. arXiv preprint
Abstract:Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt – iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the this http URL V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.
[AI-48] Bridging Silicon and the Hippocampus: Algebro-Deterministic Memory “VaCoAl” as a Substrate for Vector-HaSH and TEM
链接: https://arxiv.org/abs/2605.15652
作者: Hiroyuki Chuma,Kanji Otsuka,Yoichi Sato
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 41 pages, 5 figures, 1 table, 3 appendices
Abstract:Vector-HaSH and the Tolman-Eichenbaum Machine (TEM) propose that the hippocampal-entorhinal circuit factorizes content from a prestructured grid-cell scaffold and supports compositional memory via ripple-mediated replay. Human iEEG shows that hippocampal sharp-wave ripples (SWRs) gate episodic recall, ripple-locked cortical reactivation recapitulates encoding-time patterns, and multi-hop replay fidelity decays multiplicatively along sequence length. These literatures have advanced in parallel without a shared algebraic object. We show that VaCoAl, an algebro-deterministic hyperdimensional memory architecture built from Galois-field LFSRs, supplies that object. Specifically, deterministic Galois-field diffusion provides a substrate-level alternative to Vector-HaSH’s random scaffold-to-hippocampus projection that satisfies the same quasi-orthogonality requirement, with matched second-moment statistics, stronger avalanche behavior, and bit-exact reproducibility. The path-integral Confidence Ratio CR_2 , the product of per-step CR_1 values along an n -hop chain, is the natural functional form for multi-hop replay-fidelity decay under conditional independence of per-step reactivation, providing the first algebraically tractable model of reported multiplicative decay. STDP-like path selection in VaCoAl follows from architectural demands – similarity preservation, compositional reversibility, and bounded-frontier search – that also constrain hippocampal computation. We further argue that VaCoAl operating regimes share architectural commitments with the EC–CA3 and EC–DG–CA3 pathways, motivating an energy-capacity-plasticity reading of why both are conserved across 520 Myr of evolution and primate dentate-gyrus elaboration. We prove formal correspondences, derive testable iEEG predictions, and bridge computational neuroscience and hyperdimensional engineering.
[AI-49] Sharp Spectral Thresholds for Logit Fixed Points
链接: https://arxiv.org/abs/2605.15651
作者: Tongxi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is \beta|\Pi W\Pi|_\mathcal T\to\mathcal T2, rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2605.15651 [cs.LG] (or arXiv:2605.15651v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15651 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-50] ColPackAgent : Agent -Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing
链接: https://arxiv.org/abs/2605.15625
作者: Lijie Ding,Changwoo Do
机构: 未知
类目: Artificial Intelligence (cs.AI); Soft Condensed Matter (cond-mat.soft)
备注:
Abstract:We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self-assembly, and materials design. Without dedicated simulation tools and workflow instructions, general-purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom-built colpack Python package that wraps HOOMD-blue hard-particle Monte Carlo, and the skill encodes a four-stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end-to-end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard-disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage-specific prompts. This benchmark provides a stage-level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent-assisted research workflow.
[AI-51] A Few GPUs A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM
链接: https://arxiv.org/abs/2605.15617
作者: Shaoke Xi,ChonLam Lao,Boyi Jia,Jiaqi Gao,Zhipeng Zhang,Jiamin Cao,Brian Sutioso,Erci Xu,Minlan Yu,Kui Ren,Yong Li,Zhengping Qian,Ennan Zhai,Jingren Zhou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 13 pages body, 21 pages total
Abstract:Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters – which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58% average error in iteration time and less than 0.01% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1% of the physical GPUs required by the original deployment. Comments: 13 pages body, 21 pages total Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15617 [cs.DC] (or arXiv:2605.15617v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.15617 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-52] opoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices
链接: https://arxiv.org/abs/2605.15611
作者: Junle Wang,Xingchuang Liao,Wenjun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emphsymptom-amplification bias, misattributing the root cause to salient downstream victims. We propose \textbfTopoEvo, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emphMetric-orthogonal Multimodal Alignment (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emphVector Quantization (VQ) to discretize topology-enhanced states into auditable \emphsymptom tokens with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emphHypothesis–Evidence–Test (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emphSelf-Evolving Mechanism refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.
[AI-53] Offline Reinforcement Learning with Universal Horizon Models ICML2026
链接: https://arxiv.org/abs/2605.15603
作者: Hojun Chung,Junseo Lee,Songhwai Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: this https URL
[AI-54] STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
链接: https://arxiv.org/abs/2605.15581
作者: Junle Wang,Xingchuang Liao,Wenjun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbfSTAR, a \emphStage-attributed Triage and Repair framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emphEvidence Package (EP), \emphHypothesis Set (HS), \emphAnalysis Structure (AS), and \emphDecision Report (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emphFast/Slow Routing, \emphdecisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emphwhere an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems. Comments: 11 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15581 [cs.AI] (or arXiv:2605.15581v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.15581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-55] Detecting Privilege Escalation in Polyglot Microservices via Agent ic Program Analysis
链接: https://arxiv.org/abs/2605.15569
作者: Penghui Li,Hong Yau Chong,Yinzhi Cao,Junfeng Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: In Proceedings of the 47th IEEE Symposium on Security and Privacy (SP)
Abstract:Microservices are widely adopted in modern cloud systems due to their scalability and fault tolerance. However, microservice architectures introduce significant complexity in privilege and permission control, creating risks of privilege escalation where attackers can gain unauthorized access to resources or operations. Detecting such vulnerabilities is challenging due to complex cross-service interactions, polyglot codebases, and diverse privileged operations and permission checks. We present Neo, an agentic program analysis framework that combines large language models (LLMs) with classic program analysis to address these challenges. Neo leverages an LLM-based agent that dynamically generates analysis plans, adapts code search strategies, and validates semantics. We develop code search primitives that enable Neo to perform scalable and flexible code exploration across services and languages. We evaluated Neo on 25 open-source microservice applications spanning 7 programming languages and 6.2 million lines of code. Neo uncovered 24 zero-day privilege escalation vulnerabilities and achieved 81.0% precision and 85.0% recall on a ground-truth dataset. Compared to existing program analysis and agentic solutions, Neo demonstrated significant improvements in both detection accuracy and scalability. We further showcased Neo’s extensibility by applying it to other application domains and vulnerability types, uncovering 18 additional zero-day vulnerabilities.
[AI-56] Position: Artificial Intelligence Needs Meta Intelligence – the Case for Metacognitive AI ICML26
链接: https://arxiv.org/abs/2605.15567
作者: Sergei Chuprov,Richard D. Lange,Leon Reznik,Paulo Shakarian,Raman Zatsarenko,Dmitrii Korobeinikov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is a preliminary version accepted for presentation and publication at the 43rd International Conference on Machine Learning (ICML26). The modified final version will be available in the conference proceedings
Abstract:This position paper argues for metacognition as a general design principle for creating more accurate, secure, and efficient AI. The metacognitive solution involves systems monitoring their own states and judiciously allocating resources depending on each problem instance’s difficulty or cost of mistakes. Drawing inspiration both from past work on resource-rational AI and from well-documented metacognitive strategies in psychology and cognitive science, we identify specific challenges in embedding these strategies into AI design and highlight open theoretical and implementation problems. We showcase these principles through a tangible example of improved learning efficiency, effectiveness, and security in a Federated Learning (FL) case study. We show how these principles can be translated into practice with a novel software framework developed specifically to allow the community to design, deploy, and experiment with metacognition-enabled AI applications.
[AI-57] AstraFlow: Dataflow-Oriented Reinforcement Learning for Agent ic LLM s
链接: https://arxiv.org/abs/2605.15565
作者: Haizhong Zheng,Yizhuo Di,Jiahui Wang,Shuowei Jin,Xueshen Liu,Yongji Wu,Z. Morley Mao,Ion Stoica,Jiawei Zhao,Beidi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.
[AI-58] Operator-Controlled 6G: From Connectivity Infrastructure to Guaranteed Digital Services
链接: https://arxiv.org/abs/2605.15553
作者: David Soldani
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 81 pages, 18 figures, 66 references
Abstract:Sixth-generation mobile networks (6G) are approaching a structural inflection point. Five generations of vendor-led architectures have left operators procuring and operating networks they do not own, on platforms they cannot modify, with AI layers they cannot audit. This paper argues that 6G must reverse this trajectory by reordering operator priorities: Control First, Customer First, Business First, Operations First, and Technology Last. Technology should serve operator control, customer outcomes, monetizable guarantees, and software-driven operations, not dictate this http URL contributions operationalize this thesis. The 6G Control Compact defines a three-layer ownership taxonomy–own, federate, and consume–that allocates architectural sovereignty according to strategic value. The Guarantee Economy defines a five-tier, outcome-priced commercial model that converts operator control into enforceable service-level objectives. The framework is grounded in operational evidence from Rakuten Mobile, the world’s first national-scale, fully cloud-native, fully Open RAN deployment, which reached full-year EBITDA profitability in FY2025. It is aligned with the ITU-R IMT-2030 framework, 3GPP 6G use cases and service requirements, NGMN recommendations, ETSI standards, O-RAN Alliance and AI-RAN Alliance specifications, IOWN Global Forum sustainability metrics, Linux Foundation initiatives, and leading industry and academic programs. A three-phase roadmap covering 2025-2027, 2027-2029, and 2029-2032 and beyond, together with seven stakeholder-specific calls to action, translates the architecture into industry commitments. The central claim is that Rakuten Mobile’s deployment demonstrates the feasibility of operator-controlled 6G. Decisions made during 2026-2028 will determine whether 6G becomes a platform for guaranteed digital services or another vendor-dependent infrastructure cycle.
[AI-59] CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models
链接: https://arxiv.org/abs/2605.15549
作者: Stefano Riva,Carolina Introini,Antonio Cammi,Dean Price,Alexey Yermakov,Yue Zhao,Philippe M. Wyder,Judah Goldfeder,Jan Williams,Amy Sara Rude,Matteo Tomasetto,Joe Germany,Joseph Bakarji,Georg Maierhofer,Miles Cranmer,J. Nathan Kutz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:The demand for clean energy is ever increasing, with new nuclear technologies presenting a complementary solution to renewable energies. However, designing and operating these systems is exceptionally difficult, given the complexity of the physical phenomena that interact to form the system dynamics. While high-fidelity simulations help to understand the non-linear, multi-physics interactions within a reactor, they are computationally expensive and rarely suitable for real-time applications. Furthermore, model-based approaches are inherently sensitive to simplifying assumptions required to derive their governing equations and parameters, leading to inevitable discrepancies with real-world measurements. In contrast, Machine Learning (ML) methods have the potential to generate reliable surrogate models which may be able to quickly predict the system’s behaviour. However, the number of data-driven methods that can potentially be used for this task is large and diverse. In a safety-critical setting such as nuclear engineering, a fair comparison of different ML methods, and a clear understanding of their advantages and limitations, is of paramount importance. To address this, we introduce a Common Task Framework (CTF) for ML in nuclear engineering, building upon previous efforts in dynamical systems and seismology. This CTF considers a curated set of datasets from different nuclear and nuclear-adjacent systems. The CTF evaluates the performance of a method on 12 established metrics, alongside a new paradigm focused on system monitoring from sparse measurements only. We illustrate the framework by benchmarking standard ML baselines against these datasets, revealing current method limitations. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigour and reproducibility in scientific ML for the nuclear industry.
[AI-60] Domain-Independent Game Abstraction using Word Embedding Techniques
链接: https://arxiv.org/abs/2605.15543
作者: Juho Kim,Tuomas Sandholm
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:Many games of interest in the real world are often intractably large, thereby necessitating the use of game abstraction to shrink them in size, typically by many magnitudes. Over the last two decades, there have been significant advances in game abstraction; however, the domain-specific nature (usually poker) of much of the prior work prevents those techniques from being easily generalized to other settings without extensively analyzing the game at hand. In this paper, we propose a domain-independent approach to game abstraction, which applies word embedding techniques from the field of natural language processing. Treating each action as a word and gameplay data as a corpus, word vectors can be trained to represent each action as a real-valued vector, which can then be clustered to facilitate game abstraction. We also explore the use of foundational embedding models and show that action embeddings obtained this way can capture a surprising amount of information about the underlying game. Experimental results demonstrate that our proposed game abstraction technique is effective, although it does not outperform specialized algorithms tailored to specific games.
[AI-61] DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
链接: https://arxiv.org/abs/2605.15542
作者: Yichao Liu,Huawen Shen,Liu Yu,Shiyu Liu,Zeyu Chen,Yu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures
Abstract:GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.
[AI-62] RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent -Assisted Analysis and Revision
链接: https://arxiv.org/abs/2605.15537
作者: Jing Wang,Shang Liu,Hangan Zhou,Zhiyao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by DAC 2026
Abstract:This paper introduces RTL-BenchMT, an agentic framework for dynamically maintaining RTL generation benchmarks. Large Language Models (LLMs) assisted automated RTL generation is one of the most important directions in EDA research. However, current RTL benchmarks face two critical challenges: (1) flawed cases in the benchmarks and (2) overfitting to the benchmarks. Both challenges are difficult to resolve purely by manual engineering effort. To address these issues and systematically reduce human maintenance costs, we propose an automated agentic framework, RTL-BenchMT. RTL-BenchMT focuses on two key applications: (1) automatically identifying and revising flawed benchmark cases and (2) automatically detecting and updating overfitting cases. With the assistance of RTL-BenchMT, we conduct a thorough, in-depth analysis of flawed and overfitting cases and produce a refined benchmark suite that will be open-sourced to the community.
[AI-63] Neural Point-Forms
链接: https://arxiv.org/abs/2605.15524
作者: Bruno Trentini,Jacob Hume,Vincenzo Antonio Isoldi,Philipp Misof,Ekaterina S. Ivshina,Kelly Maggs
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Statistics Theory (math.ST)
备注:
Abstract:Point cloud learning often rests on the premise that observed samples are noisy traces of an underlying geometric object, such as a manifold embedded in a high-dimensional feature space. Yet much of this geometry is not captured directly by coordinates, pairwise distances, or learned graph neighborhoods alone. In the smooth setting, differential forms are devices to encode higher order tangency information. In this work, we introduce a new family of principled learnable geometric features for point clouds called neural point-forms (NPFs). In the absence of a natural tangency structure, we instead use Laplacian-based techniques from Diffusion Geometry to build a discrete model for comparing differential forms on point clouds via inner products. In the continuum, submanifolds of a shared ambient feature space are represented as comparison matrices, whose entries describe how pairs of feature forms interact with extrinsic tangency information. We make this intuition precise by proving the long-run consistency of comparison matrices under standard sampling, bandwidth, density, and manifold-hypothesis assumptions. This yields a compact, efficient and permutation-invariant neural layer whose output is a learned form-comparison matrix. Across synthetic and biologically relevant experiments, we show that NPFs provide a competitive, and interpretable representation, with the strongest benefits appearing when labels depend on sampling density, manifold-like structure, or response-relevant population geometry.
[AI-64] On the Frag ility of Data Attribution When Learning Is Distributed
链接: https://arxiv.org/abs/2605.15520
作者: Xian Gao,Bo Hui,Min-Te Sun,Wei-Shinn Ku
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Data attribution has become an important component of pricing, auditing, and governance in machine learning pipelines, yet most attribution methods implicitly assume that attribution values faithfully reflect participants’ contributions. We show that this assumption can fail: a single participant in a standard distributed training workflow can substantially inflate its measured attribution value while preserving global utility. Our attribution-first attack uses latent optimization to inject small synthetic batches that preserve utility while exploiting non-IID label coverage and evaluator sensitivities. Across datasets, models, and multiple marginal-utility evaluators, the attack consistently increases the adversary’s attribution value and reshapes the relative attribution structure among benign clients without degrading accuracy or triggering geometry-based defenses. These results show that attribution itself forms a new attack surface and motivate the development of attribution-robust and incentive-compatible scoring mechanisms.
[AI-65] CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
链接: https://arxiv.org/abs/2605.15513
作者: Fangzhou Lin,Shuo Xing,Peiran Li,Siyuan Yang,Qianwen Ge,Kazunori Yamada,Ziming Zhang,Haichong Zhang,Zhengzhong Tu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 2 figures, 18 tables
Abstract:Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier’s accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.
[AI-66] PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources
链接: https://arxiv.org/abs/2605.15507
作者: Bumsu Park,Chanho Park,Youngmok Park,Namyoon Lee
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:For a Gaussian source under mean-squared error (MSE), classical transform coding is rate–distortion (RD) optimal: the Karhunen–Loeve transform (KLT) diagonalizes the covariance, reverse waterfilling allocates the bits, and scalar quantization closes the loop. This elegant story breaks down for multimodal sources, where no single covariance can capture heterogeneous local geometries, and the RD function loses its closed form. We revisit this problem through Gaussian-mixture sources and develop a constructive RD theory for them. Our key finding is that the mixture structure incurs only a component label cost. Conditioned on the active mixture component, each branch is Gaussian; the challenge is allocating bits across heterogeneous branches. We prove that the genie-aided conditional RD function is governed by a single global reverse-waterfilling level shared across all components and eigenmodes. Building on this result, we introduce PrismQuant, which transmits the component label losslessly and encodes the residual using the component-matched KLT, followed by scalar quantization, achieving a rate of H©/n bits per source dimension of the converse, with a vanishing asymptotic gap. We further develop a practical implementation based on EM-driven Gaussian-mixture learning, component-adaptive KLTs, and entropy-constrained scalar quantization (ECSQ). Experiments on synthetic Gaussian mixtures show that PrismQuant closely approaches the theoretical RD bound, while experiments on real-world channel-state-information (CSI) data demonstrate competitive or superior performance compared with transformer-based learned codecs at more than one order of magnitude smaller model size.
[AI-67] Learning with Conflicts of Interest
链接: https://arxiv.org/abs/2605.15504
作者: Nischal Aryal,Arash Termehchy,Ali Vakilian,Marianne Winslett
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Financial, social, and political factors often prevent the interests of the owners of ML systems and services and their users from being perfectly aligned. ML systems often produce biased information that can influence users to make decisions that are not in their best interest. Current solution approaches require ML systems to implement protocols to mitigate their biases. However, ML system owners usually do not have any incentive to implement these protocols and often argue that it limits their freedom of expression or business. We believe that a successful solution to this problem must recognize the conflict of interest between the ML systems and their users, and use this information to protect users against information that adversely influences their decisions while allowing users to safely benefit from these systems. To this end, we propose a game-theoretic framework that models the interaction between ML systems and users with conflicts of interest. We present scalable algorithms with theoretical guarantees that maximize the amount of desired information and actions and minimize the amount of biased and manipulative actions in interaction with ML systems.
[AI-68] Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLM s
链接: https://arxiv.org/abs/2605.15491
作者: Vincent-Daniel Yun,Junhyuk Jo,Sai Praneeth Karimireddy,Sunwoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning.
[AI-69] Hybrid LLM -based Intelligent Framework for Robot Task Scheduling
链接: https://arxiv.org/abs/2605.15486
作者: Swayamjit Saha,Subhabrata Das,Haonan Duan,Xiao-Yang Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:This study introduces intelligent frameworks that use Large Language Models (LLMs) to improve task scheduling for construction robots. The LLM is fed with key data about the desired task, such as agent action abilities, and the desired end goal to be achieved. A well-balanced allocation strategy is developed, optimizing both time efficiency and resource utilization. Our system utilizes a Natural Language Processing interface to streamline communication with construction professionals and adapt in real-time to unexpected site conditions. We concurrently use two LLM agents, specifically generator (GPT-4) and supervisor (Gemma 3/Llama 4/Mistral 7b) LLM agents to provide a more precise task schedule. We evaluate the proposed methodology using a straightforward scenario and provide metric scores to prove the efficacy of the frameworks. Our results highlight that the implementation of LLMs is crucial in construction operational tasks including robots.
[AI-70] Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays
链接: https://arxiv.org/abs/2605.15480
作者: Kaize Deng,Zewen Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at 23rd IFAC World Congress 2026
Abstract:Stochastic communication delays in teleoperation introduce signal discontinuities that undermine control stability and degrade control performance. Consequently, the conventional reinforcement learning (RL) methods struggle with the delayed observations due to the delay-induced observations, leading to high-frequency chattering. To address this, we propose a hybrid control framework, delay-resilient RL, integrating a state estimator utilizing Long Short-Term Memory (LSTM) with a residual RL policy, which is resilient to stochastic delays. The LSTM reconstructs smooth, continuous state estimates from delayed observations, enabling the RL agent to learn a residual torque compensation policy that balances tracking accuracy with velocity smoothness. Experimental validation on Franka Panda robots demonstrates that our approach significantly outperforms the state-of-the-art baselines, ensuring robust and stable teleoperation even under high-variance stochastic delays.
[AI-71] DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
链接: https://arxiv.org/abs/2605.15461
作者: Yikun Zhang,Xiwei Cheng,Tianyu Liu,Yuanqi Du,Wengong Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Building state-of-the-art (SOTA) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies. Current LLM-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task. We propose \method (Self-evolving Agent Experience), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently. \method maintains a cross-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes. In some cases, \method transfers a working solution directly without test-time search. In 33 molecular property prediction tasks, \method ranks first among nine SOTA agents in a single-task setting. With memory accumulated from 16 smaller tasks, \method achieves an averaged normalized score of 0.935 on 17 held-out tasks in a cross-task evaluation setting and outperforms all baseline agents by 10-30% in a zero-test-time search regime. In summary, our work shows the advantage of cross-task memory for efficient SOTA model development in drug discovery.
[AI-72] From LLM -Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates ICML2026
链接: https://arxiv.org/abs/2605.15445
作者: Ruobing Zuo,Hanrui Zhao,Gaolei He,Zhengfeng Yang,Jianlin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. Preprint version
Abstract:Automated proving of polynomial inequalities is a fundamental challenge in automated mathematical reasoning, where rich algebraic structure and a rapidly growing certificate search space hinder scalability. Purely symbolic approaches provide strong guarantees but often scale poorly as the number of variables or the degree increases, due to expensive algebraic manipulations and rapidly growing intermediate expressions. In parallel, LLM-guided methods have made notable progress, particularly on competition-style inequalities with a small number of variables. To address the remaining scalability challenges, we propose NSPI, a neuro-symbolic framework that combines the complementary strengths of LLMs and symbolic computation for polynomial-inequality proving. Concretely, an LLM proposes a conjecture in the form of an approximate polynomial Sum-Of-Squares (SOS) decomposition; we refine it via symbolic computation to obtain an exact polynomial SOS representation, which directly proves the target inequality, and we further certify the proof in Lean, yielding an end-to-end pipeline from heuristic discovery to machine-checked proof. Experiments on challenging benchmarks involving polynomials with up to 10 variables demonstrate the effectiveness and scalability of the proposed method.
[AI-73] Runtime-Structured Task Decomposition for Agent ic Coding Systems
链接: https://arxiv.org/abs/2605.15425
作者: Shubhi Asthana,Bing Zhang,Chad DeLuca,Hima Patel,Ruchi Mahindru
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Paper presented at ACM Conference on AI and Agentic Systems 2026 at the Agentic Software Engineering workshop
Abstract:Agentic coding systems increasingly use large language models (LLMs) for software engineering tasks such as debugging, root cause analysis, and code review. However, many existing systems encode task logic, execution flow, and output generation inside monolithic prompts. This design creates brittle behavior, limited debuggability, and high retry costs because failures often require rerunning the full workflow. We present runtime-structured task decomposition, an architectural approach in which task partitioning and execution flow are managed through executable control logic rather than prompt structure alone. LLMs are used only for focused judgment tasks, and outputs are validated against predefined schemas before downstream execution. We evaluate this approach on two software engineering workloads using three configurations: monolithic execution, static decomposition with fixed subtasks and no runtime branching, and runtime-structured decomposition. Each configuration was evaluated across 10 runs. Our results show that decomposition alone does not necessarily reduce retry cost. In the Kubernetes root cause analysis workload, the static decomposition baseline produced a retry cost of 1,632 +/- 145 tokens versus 904 +/- 17 tokens for the monolithic baseline because failures forced reruns of downstream subtasks. A similar pattern appeared in the multi-file debugging workload, where the static baseline consumed 933 tokens compared to 703 tokens for the monolithic system. The runtime-structured approach reran only failed subtasks, reducing retry costs to 436 +/- 132 tokens for root cause analysis and 460 tokens for debugging. Overall, the approach achieved up to 51.7% lower retry cost than monolithic systems and 73.2% lower retry cost than static decomposition baselines, improving efficiency, debuggability, and operational reliability in agentic coding systems. Comments: Paper presented at ACM Conference on AI and Agentic Systems 2026 at the Agentic Software Engineering workshop Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15425 [cs.SE] (or arXiv:2605.15425v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.15425 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shubhi Asthana [view email] [v1] Thu, 14 May 2026 21:16:23 UTC (359 KB) Full-text links: Access Paper: View a PDF of the paper titled Runtime-Structured Task Decomposition for Agentic Coding Systems, by Shubhi Asthana and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-74] f-Trajectory Balance: A Loss Family for Tuning GFlowNets Generative Models and LLM s with Off- and On-Policy Data ICML2026
链接: https://arxiv.org/abs/2605.15417
作者: Jake Fawkes,Jason Hartford
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at ICML 2026
Abstract:In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emphon-policy its gradients correspond to those of the KL divergence, while \emphoff-policy it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of f -divergences, leading to a family of losses whose on-policy gradients are that of the corresponding f -divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and f -divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding f -divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models. Comments: Published at ICML 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15417 [cs.LG] (or arXiv:2605.15417v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15417 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-75] Margin-Adaptive Confidence Ranking for Reliable LLM Judgement ICML2026
链接: https://arxiv.org/abs/2605.15416
作者: Gaojie Jin,Yong Tao,Lijia Yu,Tianjin Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model’s estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.
[AI-76] Amortized Energy-Based Bayesian Inference
链接: https://arxiv.org/abs/2605.15407
作者: Hojjat Kaveh,Ricardo Baptista,Andrew M. Stuart
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 figures
Abstract:We consider amortized Bayesian inference for nonlinear inverse problems in settings where only samples from the joint distribution of parameters and observations are available. Classical methods such as Markov chain Monte Carlo require solving a new inference problem for each observation, which can be computationally prohibitive when inference must be repeated many times. We propose a transport-based approach that learns an observation-dependent map pushing forward a reference measure to approximate the posterior distribution. The map is trained by minimizing an averaged energy-distance objective between the true posterior and the learned pushforward. This formulation is likelihood-free, requiring only joint samples, and avoids density evaluation, invertibility constraints, and Jacobian determinant computations. For function-space inverse problems with Gaussian priors, we parameterize the transport map as the identity plus a perturbation in the Cameron-Martin space of the prior, preserving absolute continuity with respect to the prior. In infinite-dimensional settings, the map is represented using neural operators. We illustrate the method on a finite-dimensional nonlinear inverse problem and two PDE-constrained inverse problems arising in porous medium flow and seismic inversion. The results show that the learned transport captures posterior structure, including multimodality and dominant modes, while enabling fast posterior sampling for new observations.
[AI-77] Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming
链接: https://arxiv.org/abs/2605.15400
作者: Wei Sheng,Rohan Paleja
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.
[AI-78] Breakeven complexity: A new perspective on neural partial differential equation solvers
链接: https://arxiv.org/abs/2605.15399
作者: Yijing Zhang,Nicholas Roberts,Tanya Marwah,Mikhail Khodak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注:
Abstract:Neural surrogate solvers of partial differential equations (PDEs) promise dramatic speedups over numerical methods, especially in scenarios requiring many solves. However, current accuracy-based evaluations do not fully consider two central issues: (1) neural solvers incur substantial up-front costs for data generation, training, and tuning; and (2) classical solvers can also generate low-fidelity solutions at a sufficiently low simulation cost. To explicitly account for these realities and fully incorporate end-to-end costs, we propose an evaluation framework centered on breakeven complexity, a metric that counts the forward solves before a learned solver is cost-effective relative to an error-equivalent traditional solver. To evaluate this measure, we apply scaling laws to determine how much training budget to allocate to data generation and discuss how to achieve smooth error-matching in diverse settings. We evaluate the breakeven complexity of multiple neural PDE solvers on three PDEs on 2D periodic domains from APEBench and a novel benchmark of flows past multiple obstacles generated by the GPU-native PyFR code. Among other findings, our results suggest that neural PDE solvers become more effective as problems get harder in terms of cost, dimension, rollout, physics regime (e.g. higher Reynolds number), etc.
[AI-79] Representation Without Reward: A JEPA Audit for LLM Fine-Tuning
链接: https://arxiv.org/abs/2605.15394
作者: Biswa Sengupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emphand improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy’s positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired \alpha = 0.10 without correction (T3-Local at \Delta = +2.53 ~pp, p = 0.003 being the strongest), but none survives Bonferroni or Holm–Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary–cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at n = 5 seeds reproduces the null on both benchmarks (TURK: \Delta = +0.04 ~pp, p_\textpaired = 0.96 ; SYNTH: \Delta = +0.52 ~pp, p_\textpaired = 0.28 ), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.
[AI-80] Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
链接: https://arxiv.org/abs/2605.15384
作者: Songwei Dong,Zihan Chen,Chengshuai Shi,Peng Wang,Jundong Li,Cong Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 13 figures
Abstract:Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.
[AI-81] Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
链接: https://arxiv.org/abs/2605.15377
作者: Eugene Koran,Yejun Yun,Samantha Tetef,Benjamin Arnav,Pablo Bernabeu-Pérez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.
[AI-82] PACER: Acyclic Causal Discovery from Large-Scale Interventional Data
链接: https://arxiv.org/abs/2605.15353
作者: Ramon Viñas Torné,Sílvia Fàbregas Salazar,Soyon Park,Ivo Alexander Ban,Artyom Gadetsky,Nikita Doikov,Maria Brbić
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
备注: Accepted at the 43rd International Conference on Machine Learning (2026)
Abstract:Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern high-dimensional settings where large-scale interventional data are increasingly available. While interventional data can improve identifiability, existing methods remain limited by soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and reduced scalability. We introduce PACER (Perturbation-driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood-based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear-Gaussian mechanisms, we derive closed-form expressions for the expected interventional log-likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state-of-the-art methods on protein signaling and large-scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty-based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high-dimensional perturbation data is achievable through principled search space design.
[AI-83] LEAP: Trajectory-Level Evaluation of LLM s in Iterative Scientific Design
链接: https://arxiv.org/abs/2605.15341
作者: Marilyn Zhang,Tianfeng Chen,Fabián Barzuna,Ankita Rathod,Mark E. Whiting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle’s reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best’s approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.
[AI-84] Hidden in Memory: Sleeper Memory Poisoning in LLM Agents
链接: https://arxiv.org/abs/2605.15338
作者: Sidharth Pulipaka,Stanislau Hlebik,Leonidas Raghav,Sahar Abdelnabi,Vyas Raina,Ivaxi Sheth,Mario Fritz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 86 pages, 60 tables
Abstract:Large language models are increasingly augmented with persistent memory, allowing assistants to store user-specific information across sessions for personalization and continuity. This statefulness introduces a new security risk: adversarial content can corrupt what an assistant remembers and thereby influence future interactions. We propose and study sleeper memory poisoning, a delayed attack in which an adversary manipulates external context, such as a document, webpage, or repository, to cause the assistant to store a fabricated memory about the user. Unlike conventional prompt injection, the attack can remain dormant and re-emerge across multiple later conversations. We evaluate the full attack pipeline: whether poisoned memories are written, later retrieved, and ultimately used to steer the following conversations. Across stateful LLM assistants, poisoned memories were added up to 99.8% on GPT-5.5 and 95% on Kimi-K2.6. Crucially, among successful retrievals, poisoned memories cause attacker-intended agentic actions in 60-89% of evaluations across models. These results show that persistent memory can act as a long-term attack surface across multiple future conversations.
[AI-85] HoloMotion-1 Technical Report
链接: https://arxiv.org/abs/2605.15336
作者: Maiyue Chen,Kaihui Wang,Bo Zhang,Xihan Ma,Zhiyuan Yang,Yi Ren,Qijun Huang,Zihao Zhu,Yucheng Wang,Zhizhong Su
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, 6 tables. Technical report
Abstract:In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning. Comments: 20 pages, 4 figures, 6 tables. Technical report Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15336 [cs.RO] (or arXiv:2605.15336v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2605.15336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-86] Zero-Shot Goal Recognition with Large Language Models NEURIPS2026
链接: https://arxiv.org/abs/2605.15333
作者: Kin Max Piamolini Gusmão,Nathan Gavenski,Nir Oren,Felipe Meneguzzi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 1 table; appendix with 8 figures and 2 code listings (29 pages total); submitted to NeurIPS 2026
Abstract:Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.
[AI-87] Solvita: Enhancing Large Language Models for Competitive Programming via Agent ic Evolution
链接: https://arxiv.org/abs/2605.15301
作者: Han Li,Jinyu Tian,Rili Feng,Yuqiao Du,Chong Zheng,Chenyu Wang,Chenchen Liu,Shihao Li,Xinping Lei,Yifan Yao,Weihao Xie,Letian Zhu,Jiaheng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.
[AI-88] GESD: Beyond Outcome-Oriented Fairness
链接: https://arxiv.org/abs/2605.15295
作者: Gideon Popoola,John Sheppard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 7 pages, Accepted at IEEE CAI
Abstract:Machine learning (ML) algorithms are increasingly deployed in high-stakes decision-making domains such as loan approvals, hiring, and recidivism predictions. While existing fairness metrics (e.g., statistical parity, equal opportunity) effectively quantify outcome-oriented disparities, they offer limited insight into the procedure or explanation behind biased decisions. To address this gap, we propose Group-level Explanation Stability Disparity (GESD), a \textitprocedural-oriented fairness metric that measures disparities in the stability, robustness, and sensitivity of model explanations across different subgroups in a protected category. %GESD is explainer-agnostic, model-agnostic, and extends the scope of fairness analyses to the level of explainability. We further integrate GESD into a multi-objective optimization framework that jointly optimizes for utility, outcome-based fairness, and explanation-based fairness called FEU (Fairness–Explainability–Utility). Empirical results on multiple benchmark datasets show that GESD effectively captures group-wise discrepancies in explanation quality, and that FEU improves both utility and fairness over state-of-the-art methods. By bridging outcome-based and explanation-based fairness, GESD offers a comprehensive tool for diagnosing and mitigating bias in predictive modeling. Our code and datasets are available on GitHub \hyperlinkthis https URLthis https URL
[AI-89] GQA-μP: The maximal parameterization update for grouped query attention
链接: https://arxiv.org/abs/2605.15290
作者: Kyle R. Chickering,Huijuan Wang,Mengxi Wu,Alexander Moreno,Muhao Chen,Xuezhe Ma,Daria Soboleva,Joel Hestness,Zhengzhong Liu,Eric Xing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization (\muP) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of \muP scalings for grouped-query attention (GQA). We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.
[AI-90] Universal Approximation of Nonlinear Operators and Their Derivatives
链接: https://arxiv.org/abs/2605.15285
作者: Filippo de Feo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:
Abstract:Derivative-Informed Operator Learning (DIOL), i.e. learning a (nonlinear) operator and its derivatives, is an open research frontier at the foundations of the influential field of Operator Learning (OL). In particular, Universal Approximation Theorems (UATs) of nonlinear operators and their derivatives are foundational open questions and delicate problems in nonlinear functional analysis. In this manuscript, we prove the first UATs of non-linear k -times differentiable operators between Banach spaces and their derivatives, uniformly on compact sets and in weighted Sobolev norms for general finite input measures, via OL architectures. Our results are the first complete generalizations of the corresponding influential classical results in [Hornik, 1991] to infinite-dimensional settings and OL. We discuss several open areas where DIOL and our UATs find applications: high-order accuracy in OL, fast constrained optimization in Banach spaces (e.g. optimal control of PDEs, inverse problems) and numerical methods for infinite-dimensional PDEs (e.g. HJB PDEs on Banach spaces from optimal control of PDEs, SPDEs, path-dependent systems, partially observed systems, mean-field control). We parameterize nonlinear operators via Encoder-Decoder Architectures, renowned classes in OL due to their generality, including classical architectures, such as DeepONets, Deep-H-ONets, PCA-Nets. Our results are based on four key features that allow us to prove UATs in full generality: (i) Approximation Properties of Banach spaces. (ii) k -times continuous differentiability in the sense of Bastiani (weaker than k -times continuous Fréchet differentiability). (iii) Natural compact-open topologies for UA; indeed, we show that UA in standard compact-open topologies induced by operator norms is violated even for Fréchet derivatives. (iv) Construction of novel weighted Sobolev spaces for the UA. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA); Numerical Analysis (math.NA); Optimization and Control (math.OC) Cite as: arXiv:2605.15285 [cs.LG] (or arXiv:2605.15285v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-91] Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance
链接: https://arxiv.org/abs/2605.15281
作者: Vinil Pasupuleti,Siva Rama Krishna Varma Bayyavarapu,Shrey Tyagi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 5 tables, IEEE conference format
Abstract:Modern web test suites rot. A UI refactor breaks locators, a timing change causes race conditions, and within weeks developers abandon the suite entirely. This paper presents an AI-driven autonomous testing framework that addresses these failure modes through five integrated strategies - navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning - implemented over a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated across four production applications and 176 scenarios, the framework improves script generation success from 55% to 93%, achieves an 8x reduction in navigation failures, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% compared to manual Selenium authoring. The framework extends naturally to security validation: testers describe attack scenarios in plain English - “try accessing another user’s invoice” - which the agent converts to OWASP Top 10-aligned browser probes, detecting 85% of authentication bypass vulnerabilities and 95% of input validation flaws with false positive rates below 12%. Natural-language-driven security testing of this kind represents, to our knowledge, a novel contribution to the field.
[AI-92] PDRNN: Modular Data-driven Pedestrian Dead Reckoning on Loosely Coupled Radio- and Inertial-Signalstreams
链接: https://arxiv.org/abs/2605.15252
作者: Peter Bauer,Andreas Porada,Felix Ott,Christopher Mutschler,Tobias Feigl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 12 pages
Abstract:Modern pedestrian dead reckoning (PDR) systems rely on fusing noisy and biased estimates of position, velocity, and calibrated orientation derived from loosely coupled sensors to determine the current pose of a localized object. However, discrepancies in the sampling rates of sensor-specific estimation methods and unreliable transmission pose significant challenges. And traditional methods often fail to effectively fuse multimodal sensor data during dynamic movements characterized by high accelerations, velocities, and rapidly varying orientations. To address these limitations, we propose a simple recurrent neural network (RNN) architecture capable of implicitly forecasting asynchronous sensor data streams from diverse estimation methods along reference trajectories. The proposed approach introduces PDRNN, a modular hybrid AI-assisted PDR system that handles each component as an independent ensemble of machine learning (ML) models to estimate both key parameter means and variances. Separate ML-based models are employed to estimate orientation, (un)directed velocity or distance from acceleration and gyroscope data, with optional absolute positioning from synchronized radio systems such as 5G for stabilization. A final fusion model combines these outputs, position, velocity, and orientation, while using uncertainty estimates to enhance system robustness. The modular design allows individual components to be updated, fine-tuned, or replaced without affecting the entire system. Experiments on dynamic sports movement data show that PDRNN achieves superior accuracy and precision compared to classic and ML-based methods, effectively avoiding error accumulation common in black-box approaches. And PDRNN offers forecast capabilities and better component control despite increased system complexity.
[AI-93] GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
链接: https://arxiv.org/abs/2605.15250
作者: Fanxu Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA’s, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.
[AI-94] Reading the Cell Designing the Cure: Perturbation-Conditioned Molecular Diffusion for Function-Oriented Drug Design
链接: https://arxiv.org/abs/2605.15243
作者: Ziyu Xu,Zijian Zhang,Liang Wang,Zhiyuan Liu,Qiang Liu,Shu Wu,Liang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
备注:
Abstract:When reliable target structures are unavailable at scale or phenotypes arise from dysregulated pathways, transcriptomic perturbations provide a system-level functional readout for drug action. In this work, we formalize \emphTranscriptome-based Drug Design (TBDD) as a generative inverse problem: designing drug molecules conditioned on desired transcriptomic state transitions. We analyze the inherently ill-posed nature of this task, which is further complicated by the profound domain gap between biology and chemistry and by the sparsity of transcriptomic signals. To address these challenges, we propose \textbf\themodel (A \textbfCell\textbfUlar \textbfResponse \textbfEngine), a multi-resolution transcriptome-guided diffusion framework. \themodel features a specialized \textbfTranscriptome Perturbation Functional Feature Extractor (TFE) that (1) distills function-oriented perturbation embeddings from pre/post states, (2) aligns these signatures to dual chemical views to bridge the cross-modal gap, and (3) performs heterogeneity-aware aggregation to extract robust state-specific signals from noisy transcriptomic data. Extensive evaluations on both standard benchmarks and rigorous out-of-distribution protocols demonstrate that \themodel consistently outperforms strong baselines in structural quality and functional consistency. Furthermore, we validate its practical utility via a zero-shot gene-inhibitor design task, highlighting the potential of phenotype-driven generative discovery.
[AI-95] Hydra: Efficient Correct Code Generation via Checkpoint-and-Rollback Support
链接: https://arxiv.org/abs/2605.15238
作者: Alexander Du,Jianjun Ou,Danyang Zhuo,Matthew Lentz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Large language models are increasingly used for code generation, but many generated programs fail to compile, a prerequisite for further correctness checks such as unit tests. Existing solutions for repairing static errors are costly in both latency and token consumption. Post-hoc repair delays error detection until generation completes and commonly regenerates large regions of previously valid code. Constrained semantic decoding checks after each token, incurring per-token overhead while limiting repair to the current token even when the root cause lies earlier. We present Hydra, a system for efficient recovery from static errors during code generation. Hydra allows checking to proceed asynchronously with generation, avoiding checker overhead when the generated code is semantically correct. In addition, it provides checkpoint-and-rollback support for targeted repair, avoiding regeneration and rechecking of valid prefixes. We retrofit the Clang C/C++ compiler to support Hydra with modest modifications. Paired with a token-efficient repair strategy, Hydra reduces latency by up to 71% and token consumption by up to 70% relative to post-hoc repair on C/C++ code generation tasks that encounter static errors. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2605.15238 [cs.SE] (or arXiv:2605.15238v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.15238 Focus to learn more arXiv-issued DOI via DataCite
[AI-96] A3D: Agent ic AI flow for autonomous Accelerator Design
链接: https://arxiv.org/abs/2605.15237
作者: Abinand Nallathambi,Christopher Knight,Shantanu Ganguly,Wilfried Haensch,Anand Raghunathan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Accelerating applications through the design of hardware accelerators can significantly enhance system performance and energy efficiency. Despite advances, such as high-level synthesis (HLS), designing accelerators for complex applications still remains highly labor-intensive, demanding considerable expertise in understanding workloads to be accelerated, hardware design, micro-architecture, and EDA tool usage, posing challenges for application domain experts. Therefore, most accelerator solutions are limited to applications with a regular predictable dataflow. Advances in AI have enabled agents that perform autonomous planning, reasoning, execution and reflection, leading to unprecedented potential for automation through agentic AI. We present A3D, an Agentic AI flow for end-to-end Automation of hardware Accelerator Design. A3D automates workload analysis, performance bottleneck identification, code refactoring for HLS compatibility and micro-architecture generation. A3D also generates diverse accelerator designs by automatically exploring the speed-area tradeoff space. Recent efforts have explored the use of AI for specific tasks such as design space exploration in HLS, leaving several tasks to still be performed manually. A3D addresses the challenges in applying modern LLMs to accelerator design by judiciously partitioning tasks among specialist agents, orchestrating process loops with specialist and verifier agents, utilizing pre-existing and custom tools, and employing agentic RAG for codebase and proprietary EDA tool documentation exploration. Our implementation of A3D, using commercial components like Claude Sonnet 4.5 and the Catapult HLS tool, demonstrates its effectiveness by generating accelerator designs with no human intervention from complex scientific applications like LAMMPS (molecular dynamics simulation) and QMCPACK (quantum chemistry).
[AI-97] Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning
链接: https://arxiv.org/abs/2605.15236
作者: Amirhossein Yousefiramandi
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:With the coded caching, the server can use the information the users have cached to serve multiple users at a time by sending a single coded multi-casting message, i.e., the merged message, thereby relieving the peak network loads. However, for the delay-sensitive applications of the users, like the video streaming services, it becomes essential to choose which messages to merge online, considering the strict deadlines for each request. The problem, however, is that while the merge is helpful for the formation of the current coded multi-casting message, it can be harmful for the subsequent ones. We proposed a DRL-based solution that formulates the deadline-constrained coded delivery as a masked discrete-action queue-state control problem, while we trained a graph-attention policy network via proximal policy optimization. The policy network reduces the broadcast-packet expiration ratio \rho by 40.9% ( 0.208 vs. 0.352 ) with respect to the best coded multi-casting baseline (SACM++) on the uniform-demand benchmark, while also attaining the best broadcast-efficiency score \sigma across the Track A battery among the coded multi-casting methods. The interesting fact we observed is that for the applications of the users with tight deadlines, the method of selective merging is better than the method of aggressive merging, i.e., the policy network learns to merge at only \approx 31.8% rate, even though the same observation holds across the variations within the same simulator family.
[AI-98] PBT-Bench: Benchmarking AI Agents on Property-Based Testing
链接: https://arxiv.org/abs/2605.15229
作者: Lucas Jing,Xinqi Wang,Liao Zhang,Simon S. Du
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library’s documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15229 [cs.SE] (or arXiv:2605.15229v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.15229 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Guohao Jing [view email] [v1] Wed, 13 May 2026 18:01:05 UTC (345 KB) Full-text links: Access Paper: View a PDF of the paper titled PBT-Bench: Benchmarking AI Agents on Property-Based Testing, by Lucas Jing and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-99] Verifiable Agent ic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems
链接: https://arxiv.org/abs/2605.15228
作者: Jun He,Deying Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pager, 2 figures, 4 tables
Abstract:Modern cloud and enterprise systems rely on identity-centric authorization, assuming that callers possessing valid credentials are safe to execute commands. The emergence of autonomous AI agents invalidates this assumption: agents can generate syntactically valid but semantically unsafe actions, making standing privileges a significant operational risk. This risk becomes especially acute in sovereign AI systems, where autonomous agents may interact with cloud infrastructure, regulated data, financial workflows, and national-scale digital services. Governed mutation substrates reduce this risk by interposing on agent actions: agents submit intents, infrastructure evaluates context and policy, and execution is mediated. However, this shifts the trust boundary: how can the decision to authorize an intent be made verifiable, distributed, and replayable? We introduce a Distributed Trust Framework (DTF), a verification framework for governed mutation systems that computes execution authority from structured, verifiable artifacts. DTF introduces a Justification Proof to encode the admissibility basis of an action, a consensus model for independent evaluation, an ephemeral Execution Identity derived from the approved proof, and an append-only Evidence Chain that preserves the authorization lifecycle. Under stated substrate assumptions, this architecture enforces a compact authorization invariant: no high-stakes execution without a proof object, no derived authority without consensus, and no valid mutation detached from evidence. We define the model, instantiate it over an OpenKedge-based governed mutation substrate, and show how it maps onto cloud-native environments. By shifting authorization from standing identity to proof-derived authority, DTF provides an infrastructure foundation for making agentic execution governable, auditable, and bounded in sovereign AI deployments. Comments: 19 pager, 2 figures, 4 tables Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.15228 [cs.AI] (or arXiv:2605.15228v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.15228 Focus to learn more arXiv-issued DOI via DataCite
[AI-100] NIMO Controller: a self-driving laboratory orchestrator based on the Model Context Protocol
链接: https://arxiv.org/abs/2605.15227
作者: Naruki Yoshikawa,Ryo Tamura
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Robotics (cs.RO)
备注: 9 pages, 4 figures
Abstract:Self-driving laboratories (SDLs) have attracted increasing attention as a means of accelerating scientific discovery; however, developing SDL software remains technically demanding. To improve accessibility, orchestration software frameworks have been proposed to coordinate SDL components. Nevertheless, existing frameworks are primarily designed for human interaction and do not provide standardized interfaces suitable for AI agents. In this work, we propose an SDL software architecture based on the Model Context Protocol (MCP), in which all SDL functionalities are exposed through MCP servers. Following this design principle, we introduce an MCP-based SDL orchestrator, named NIMO Controller. It provides a visual programming interface automatically generated through MCP-based tool discovery, allowing human users to design experimental workflows without writing code. The same MCP backend can also be accessed by AI agents, providing a unified interface for both human users and AI agents. We demonstrate the proposed system through a case study on a color-matching SDL. The results validate the usability of the proposed MCP-based SDL architecture.
[AI-101] Is Agent ic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
链接: https://arxiv.org/abs/2605.15226
作者: Qingyun Zou,Feng Yu,Hongshi Tan,Bingsheng He,WengFai Wong
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbfPhoenix-bench, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37% to 58% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only +1.4 % because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by 42 % to 45 % because the test case tells \emphwhere the bug is and \emphwhat the fix has to look like.
[AI-102] GenAI-Driven Approach to RISC-V Supply Chain Exploration
链接: https://arxiv.org/abs/2605.15223
作者: Nenad Petrovic,Andre Schamschurko,Yingjie Xu,Alois Knoll
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents an LLM-empowered workflow for RISC-V supply chain analysis, integrating Vision-Language Models (VLMs) and Model-Driven Engineering (MDE) to enable comprehensive, multimodal data-driven insights. The proposed approach addresses the challenges of heterogeneous and unstructured supply chain data by leveraging LLMs for textual understanding and VLMs for extracting information from visual artifacts such as diagrams, tables, and scanned documents. These models collaboratively identify key entities and relationships, which are then organized into a knowledge graph representing supply chain components and their interdependencies. For analytical reasoning, the workflow incorporates MDE techniques and constraint-based modeling to enable formal validation of dependencies, detection of bottlenecks, and assessment of risks. The synergy between LLM- and VLM-based semantic understanding and MDE-based formal analysis supports both exploratory and systematic evaluation of supply chain resilience. A human-in-the-loop mechanism further enables interactive querying and expert validation. The approach is evaluated in RISC-V ecosystem scenarios, demonstrating its effectiveness in generating actionable insights, enhancing transparency, and supporting decision-making in complex semiconductor supply chains.
[AI-103] NOVA: Fundamental Limits of Knowledge Discovery Through AI
链接: https://arxiv.org/abs/2605.15219
作者: Salman Avestimehr,Ken Duffy,Muriel Médard
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain’’ loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good–Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model’s effective discovery distribution to a Zipf law with exponent \alpha1 , we prove that the cumulative generation cost required to obtain D distinct genuine discoveries satisfies R_\mathrmcum(D)=\Theta(c_\mathrmgenD^\alpha) , where c_\mathrmgen is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.
[AI-104] CAX-Agent : A Lightweight Agent Harness for Reliable APDL Automation
链接: https://arxiv.org/abs/2605.15218
作者: Chenying Lin,Yichen Hai,Yi He,Ran Wang,Haiyan Qiang,Liang Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 8 pages, 6 figures, IEEE conference format
Abstract:Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation. This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components – the recovery this http URL-Agent organizes execution into three layers – LLM service, agent harness, and solver backend – with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen’s kappa = 0.84, 96 percent of score pairs within one point). Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff’s delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.
[AI-105] Fair outputs Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLM s for High-Stakes Decisions
链接: https://arxiv.org/abs/2605.15217
作者: Jagdish Tripathy,Marcus Buckmann
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); General Economics (econ.GN)
备注: 39 pages, 16 figures, 2 tables
Abstract:Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning. These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.
[AI-106] SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces
链接: https://arxiv.org/abs/2605.15215
作者: Duling Xu,Zheng Chen,Zaifeng Pan,Jiawei Guan,Dong Dong,Jialin Li,Bangzheng Pu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Recently, skills have been widely adopted in large language model (LLM)-based agent systems across various domains. In existing frameworks, skills are typically injected into the agent reasoning loop as contextual guidance once matched to a runtime task, enabling specialized task-solving capabilities. We find that this execution paradigm introduces two major sources of redundancy: irrelevant context injection and repeated skill-specific reasoning and planning. To this end, we propose SkillSmith, a boundary-first compiler-runtime framework that compiles skill packages offline into minimal executable interfaces. By extracting fine-grained operational boundaries from skills, SkillSmith enables agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead. In the evaluation on SkillsBench benchmark, SkillSmith reduces solve-stage token usage by 57.44%, thinking iterations by 42.99%, solve time by 50.57% (2.02x faster), and token-proportional monetary cost by 57.44% compared with using raw-skills. Moreover, compiled artifacts produced by a stronger model can be reused by a smaller or more efficient runtime model, improving task accuracy in cases where raw skill interpretation fails. The source code and data are available at this https URL.
[AI-107] Fault tolerance estimation in digital circuits with visualised generative networks
链接: https://arxiv.org/abs/2605.15212
作者: Sascha Biel,Carl Alexander Gaede,Amiel Glaser,Jan Wolter,Alexej Schelle
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 7 pages, 7 figures, 1 table
Abstract:We propose a new numerical method to estimate the fault tolerance of failure modes in digital circuit structures with a generative network sampling technique. From a random input of generated bitwise configurations of ideally digitalised analog currents in the digital circuit design with classical logical gates, expected output currents are compared to the realistic signals of a numerical experiment at the discriminator part of the Generative Adversarial Network (GAN) to calculate the deviation from ideal digital electronic signals, including various error modes, such as missing or interchanged logical devices. From the present analysis of a representation of the GAN in terms of complex variables, it is possible to evaluate the robustness in electronic designs by differentiating the impact of failure modes associated with different classical logical elements in the circuit.
[AI-108] Quantization Undoes Alignment: Bias Emergence in Compressed LLM s Across Models and Precision Levels
链接: https://arxiv.org/abs/2605.15208
作者: Plawan Kumar Rath,Rahul Maliakkal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 4 tables. Accepted at IEEE Cloud Summit 2026. This is the author’s accepted version; the version of record will appear in IEEE Xplore
Abstract:Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly understood. Existing studies typically compare only two conditions (full-precision vs. a single quantized variant), rely on aggregate bias metrics, and evaluate a single model family, making it impossible to distinguish gradual degradation from threshold-dependent safety failures. We conduct a controlled empirical study of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 through 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Our results reveal that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression, while models’ willingness to select “unknown” answers declines by 17.4%. Crucially, these item-level changes are invisible to standard quality metrics: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit. These findings demonstrate that aggregate evaluation metrics systematically miss fairness-critical degradation, underscoring the need for quality-aware compression protocols that explicitly test for bias emergence before deployment.
[AI-109] AgentS top: Terminating Local AI Agents Early to Save Energy in Consumer Devices
链接: https://arxiv.org/abs/2605.15206
作者: Dzung Pham,Kleomenis Katevas,Ali Shahin Shamsabadi,Hamed Haddadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: ACM CAIS '26
Abstract:Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at this https URL. Comments: ACM CAIS '26 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.15206 [cs.LG] (or arXiv:2605.15206v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15206 Focus to learn more arXiv-issued DOI via DataCite
[AI-110] Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations
链接: https://arxiv.org/abs/2605.15205
作者: Nanxu Gong,Zixin Chen,Haotian Li,Zishu Zhao,Jianxun Lian,Huamin Qu,Yanjie Fu,Xing Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.
[AI-111] SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch
链接: https://arxiv.org/abs/2605.15204
作者: Zhantao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 14 tables
Abstract:Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.
[AI-112] Ensuring Logic in the Fog: Sound POMDP Synthesis with LTL Objectives IJCAI ECAI2026
链接: https://arxiv.org/abs/2605.12581
作者: Can Zhou,Yulong Gao,Pian Yu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Optimization and Control (math.OC)
备注: Accepted by IJCAI-ECAI 2026, the 35th International Joint Conference on Artificial Intelligence
Abstract:Synthesising autonomous agents that can navigate uncertain environments while adhering to complex temporal constraints remains a fundamental challenge. While Linear Temporal Logic (LTL) provides a rigorous language for specifying such tasks, the inherent undecidability of qualitatively verifying LTL satisfaction in partially observable Markov decision processes renders quantitative synthesis difficult, especially when designing reliable reward signals for approximate solvers. In this paper, we bridge this gap with a novel, sound reward-shaping mechanism that dynamically generates belief-dependent rewards grounded in certified LTL satisfaction. By integrating this mechanism into an enhanced Monte Carlo Planning framework, we empower agents to navigate the `fog’ of partial observability with a search process focused on maximising verifiable success. Our experiments demonstrate that this approach not only thrives in scenarios where existing solvers fail but also maintains effectiveness and scalability across diverse benchmark domains.
[AI-113] Symplectic Neural Operators for Learning Infinite Dimensional Hamiltonian Systems
链接: https://arxiv.org/abs/2605.15881
作者: Yeang Makara,Yusuke Tanaka,Takashi Matsubara,Takaharu Yaguchi
机构: 未知
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:The modeling and simulation of infinite-dimensional Hamiltonian systems are central problems in mathematical physics and engineering, however they pose significant computational and structural challenges for standard data-driven architectures. In this work, we introduce the Symplectic Neural Operator, a neural operator architecture designed to preserve the symplectic structure intrinsic to Hamiltonian PDEs. We provide a theoretical characterization of their symplecticity and establish a rigorous long-term stability result based on the combination of symplectic structure preservation and learning accuracy. Numerical experiments on canonical Hamiltonian PDEs corroborate this theoretical result and show that SNOs exhibit improved energy behavior compared with non-structure-preserving neural operators.
[AI-114] α-TCAV: A Unified Framework for Testing with Concept Activation Vectors
链接: https://arxiv.org/abs/2605.15688
作者: Ekkehard Schnoor,Jawher Said,Malik Tiomoko,Wojciech Samek,Alexander Jung
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
备注: 44 pages, 12 figures
Abstract:Concept Activation Vectors (CAVs) are a fundamental tool for concept-based explainability in deep learning, yet their practical utility is limited by statistical instability. We analyze the stochastic nature of CAVs and the Testing with CAVs (TCAV) method, deriving the distributions of major CAV classes including PatternCAV, FastCAV, and ridge regression-based CAVs. We then identify a fundamental flaw in the standard TCAV score: its reliance on a discontinuous indicator function induces non-decaying variance in critical regimes. To address this, we introduce \alpha -TCAV, a generalized framework that replaces the indicator with a parameterized smooth function, yielding a unified probabilistic formulation that subsumes both TCAV and Multi-TCAV. We characterize the induced distributions of sensitivity scores and different TCAV variants, showing that established state-of-the-art choices lack theoretical justification. We provide principled guidance on tuning the parameter in \alpha -TCAV – either to imitate Multi-TCAV at substantially lower computational cost, or to obtain a calibrated Bayes-optimal probabilistic measure of a concept’s influence. Finally, our analysis yields practical recommendations that challenge established routines: most notably, allocating the full sampling budget to a single CAV rather than splitting it across several.
[AI-115] FZ-Tree: An Ultra-Lightweight Waveform Classification Framework for Resource-Constrained Devices
链接: https://arxiv.org/abs/2605.15656
作者: Hao Wang,Kuang Zhang,Yonggang Chi,Tianqi Zhao,Yanbo Fu,Jiaxing Guo
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Under the trend of multi-waveform coexistence in 6G IoT, intelligent receivers must first identify physical-layer waveform types before performing correct demodulation and resource scheduling. However, existing signal identification research largely focuses on symbol-level modulation classification. Research directly targeting physical-layer waveform types (e.g., OFDM, OTFS, LoRa) is not only extremely scarce but also heavily reliant on deep neural networks and complex time-frequency transforms, making deployment on resource-constrained terminals difficult. Symbol modulation classification methods themselves cannot circumvent the prerequisite of ``waveform identification first.‘’ To address this dual gap, we propose an ultra-lightweight waveform classification framework based on time-frequency multidimensional features with a cooperative Z-test tree (ZTree). The framework employs low-complexity time-domain feature extraction, and the classification backend adopts a ZTree optimized by Z-statistical testing, which uses hypothesis testing confidence to automatically control decision tree splitting and size, ensuring efficient execution on resource-limited processors. Tested on ten 6G candidate waveforms including OFDM, OTFS, DSSS, LoRa, and NB-IoT, the method achieves 99.5% average accuracy under AWGN and 87.4% under TDL-C multipath channels, with main confusion between OTFS and LoRa. Implemented in C on an x86 platform, single inference latency is under 4~ms. To the best of our knowledge, this is the first work achieving real-time recognition of ten IoT waveform types. Future work will target deployment acceleration on embedded MCUs. Code and dataset are open-sourced at: this https URL. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15656 [eess.SP] (or arXiv:2605.15656v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2605.15656 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-116] Diagonal Adaptive Non-local Observables on Quantum Neural Networks
链接: https://arxiv.org/abs/2605.15410
作者: Huan-Hsin Tseng,Yan Li,Hsin-Yi Lin,Samuel Yen-Chi Chen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICCCN2026
Abstract:Adaptive Non-local Observables (ANOs) have shown that making quantum observables dynamic can substantially enlarge the function space of Variational Quantum Algorithms, partly shifting hardware demands from circuit synthesis to measurement design. However, this advantage is accompanied by a steep increase in the number of parameters, as well as the classical optimization cost for varying general Hermitian observables. We propose a special form of ANO that significantly reduces this burden by considering only diagonal observables paired with quantum circuits. Mathematically, this is equivalent to the full ANO of a large parameter space since diagonal matrices are canonical representatives of the ANO space modulo unitary similarity. As a result, Diagonal ANO retains the same capability of full ANO while reducing k -local observable complexity from O(4^k) to O(2^k) and lowering the corresponding measurement-side classical computation. In this sense, diagonal ANO preserves much of the benefit of full ANO while encompassing conventional VQCs as a special case. Comments: Accepted at ICCCN2026 Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.15410 [quant-ph] (or arXiv:2605.15410v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.15410 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-117] Do Biological Structural Guarantees Earn Their Complexity?
链接: https://arxiv.org/abs/2605.15225
作者: Bogdan Banu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Biologically-inspired AI agent frameworks claim reliability benefits through structural guarantees adapted from gene regulatory networks, immune systems, and metabolic control. These claims are rarely tested empirically against simpler alternatives. We present three deep benchmarks: metabolic priority gating, autoinducer-based quorum sensing, and Bayesian stagnation detection, each comparing a biologically-grounded implementation against a naive non-biological alternative and an ablated control, across 1,000 trials per seed and 10 seeds (10M+ data points total).
机器学习
[LG-0] Dynamics-Level Watermarking of Flow Matching Models with Random Codes ATC
链接: https://arxiv.org/abs/2605.16239
作者: Shuchan Wang
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, code available at: this https URL
Abstract:We introduce a dynamics-level approach to watermarking generative models. Rather than embedding signals into model weights or outputs, we embed the watermark directly into the learned continuous dynamics – the velocity field of a flow matching model. We formulate this as random coding over a continuous channel: a key-dependent perturbation is added during training, and the message is recovered at detection time from black-box queries. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR-10 across different architectures confirm reliable message recovery, preserved generation quality, and chance-level decoding accuracy without the secret key.
[LG-1] he Privacy Price of Tail-Risk Learning: Effective Tail Sample Size in Differentially Private CVaR Optimization
链接: https://arxiv.org/abs/2605.16219
作者: El Mustapha Mansouri
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 3 figures, 2 tables
Abstract:Differential privacy changes the effective sample size governing CVaR learning. For tail mass \tau , the privacy-relevant sample size is not n , but n\tau ; equivalently, the effective private tail sample size is \epsilon n\tau . Private CVaR excess risk decomposes into ordinary tail-risk statistical error and a privacy price. This decomposition is complete for scalar estimation and finite classes: scalar estimation has rate \Theta(B \min\1,(n\tau)^-1/2+(\epsilon n\tau)^-1) , and finite classes of size M have rate \Theta(B \min\1,\sqrt\log(2M)/(n\tau)+\log(2M)/(\epsilon n\tau)) . These complete rates hold under pure DP, and their lower bounds extend to approximate DP in the stated small- \delta regimes. For convex Lipschitz learning, modular upper and lower reductions show that the CVaR-specific privacy term necessarily scales as 1/(\epsilon n\tau) , with dimension dependence inherited from private stochastic convex optimization. Together, these results identify ordinary private learning on \Theta(n\tau) informative tail records as the canonical hard subproblem inside private CVaR learning.
[LG-2] Hypothesis-driven construction of mesoscopic dynamics
链接: https://arxiv.org/abs/2605.16211
作者: Zhuoyuan Li,Aiqing Zhu,Qianxiao Li
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 38 pages, 10 figures
Abstract:Traditional scientific modeling typically begins with fixed, instance-wise effective equations and then carries out equation-specific analysis and computation, a procedure that becomes exceptionally challenging in complex applications such as multiscale systems. We propose an alternative paradigm by learning mesoscopic dynamics within a mathematically constrained hypothesis class. Building upon a generalized Onsager principle, we introduce a unified framework encompassing both dissipative and conservative mesoscopic dynamics. We establish uniform and a priori theoretical guarantees, including global well-posedness, asymptotic stability, unique factorization identifiability, and discrete energy dissipation, applicable to all spatio-temporal evolution equations within this hypothesis class prior to all learning stages. Data from each problem instance is then used to guide the identification of members within our hypothesis class, giving rise to accurate, robust and interpretable dynamical models. We empirically validate this framework on both data from continuum PDE models as a check, and on data arising from microscopic chain models for which exact meso-scale models are unknown. The proposed approach not only acts as an effective dynamics learner, but also offers vital interpretable diagnostics of the underlying physics.
[LG-3] Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
链接: https://arxiv.org/abs/2605.16184
作者: Yishun Lu,Junhao Zhang,Zeyu Yang,Wes Armour
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbfAsteria, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter language model. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter language model. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2605.16184 [cs.DC] (or arXiv:2605.16184v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.16184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-4] Imitation learning for clinical decision support in pediatric ECMO
链接: https://arxiv.org/abs/2605.16175
作者: Fateme Golivand,Michael Skinner,Saurabh Mathur,Ameet Soni,Phillip Reeder,Kristian Kersting,Lakshmi Raman,Sriraam Natarajan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pediatric critical care is a dynamic, high-stakes process involving constant monitoring and adjustments in life-saving treatments. Modeling these interventions is crucial for effective decision support. To address the challenges of high complexity and data scarcity in pediatric Extracorporeal Membrane Oxygenation (ECMO), we frame clinical decision-making as learning to act from trajectories, i.e., imitation learning that learns action models from observational data, with a key feature that actions are not directly observed. We consider TabPFN, a recent transformer-based approach for tabular data, and traditional baselines including XGBoost and Multi-Layer Perceptrons(MLPs) on real-world pediatric ECMO data to learn the action models. We find that the TabPFN-based approach consistently outperforms these classical baselines, supporting its use as a strong clinician-behavior baseline for pediatric ECMO decision support.
[LG-5] BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control
链接: https://arxiv.org/abs/2605.16170
作者: Yifan Zhang,Liang Zheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-world control systems frequently operate under \emphpiecewise stationary conditions, where dynamics remain stable for extended periods before undergoing abrupt regime changes. Standard robust RL methods face a fundamental dilemma: a globally conservative policy wastes performance during stable periods, while a locally adaptive policy risks catastrophic failure when the regime changes undetected. We propose \textbfBAPR (Bayesian Amnesic Piecewise-Robust SAC), which unifies Bayesian Online Change Detection (BOCD) with robust ensemble RL. The BAPR operator – a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution – is a \gamma -contraction. A complementary counterexample, machine-verified in Lean~4, establishes a \emphsharp boundary: when beliefs depend on the Q-function, the contraction factor becomes \gamma + \lambda\Delta (where \Delta is the mode reward gap), and contraction fails exactly when \gamma + \lambda\Delta \geq 1 . We derive a \emphcomponent-wise formal error budget for the abstract operator – every component machine-verified – bounding post-switch recovery; the budget applies to the abstract mode-mixture operator and inherits to the implemented shared-critic algorithm only through the frozen-parameter design intuition. All results are formally verified with no \textttsorry (1,145 lines across 3 Lean~4 files, 22 machine-verified theorems). BOCD drives an adaptive conservatism mechanism: the policy becomes maximally conservative after detected change-points and smoothly relaxes as confidence grows, with detection delay O(\log(1/\delta)) . A context-conditioning module trained via RMDM loss provides mode-aware representations from simulator-provided mode IDs at training time and requires no mode labels at deployment. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.16170 [cs.LG] (or arXiv:2605.16170v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Entropic Auto-Encoding via Implicit Free-Energy Minimization
链接: https://arxiv.org/abs/2605.16164
作者: Hazhir Aliahmadi,Irina Babayan,Greg van Anders
类目: Machine Learning (cs.LG)
*备注: 22 pages, 5 figures
Abstract:Despite their ubiquity, variational autoencoders (VAEs) inherently suffer from posterior collapse, a failure mode in which latent variables are effectively ignored. This failure arises because explicit prior imposition drives optimization toward loss landscape regions corresponding to uninformative latent representations. Here, we introduce Entropic Autoencoders (EAEs), a framework in which reconstruction loss is the only explicit objective, and entropy generates the latent variables’ prior implicitly through a free energy-minimizing ensemble of encoders. This ensemble biases learning toward high-volume regions of near-optimal solutions, while decoder updates direct the search trajectories toward informative latent representations. We demonstrate that EAEs mitigate posterior collapse by learning non-Gaussian, multimodal latent distributions that yield diverse, data-consistent generations and preserve different forms of underlying structure in the data. As a proof-of-concept, we show that an EAE captures a superposition of the known low-dimensional dynamics of a reaction-diffusion process. Then, we show that an EAE identifies implicit categorical distinctions in MNIST latent representations, and displays a hierarchical understanding of facial structure on the CelebA dataset, from an “all-human” face to individual-dependent features.
[LG-7] Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
链接: https://arxiv.org/abs/2605.16154
作者: Vaidehi Bagaria,Nikshep Grampurohit,Pulkit Verma
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO’s learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.
[LG-8] Multi-Fidelity Flow Matching: Cascaded Refinement of PDE Solutions
链接: https://arxiv.org/abs/2605.16118
作者: Sipeng Chen,Junliang Liu,Hewei Tang,Shibo Li
类目: Machine Learning (cs.LG)
*备注: 27 pages, 2 figures, 7 tables. Preprint
Abstract:The source distribution in conditional flow matching is a design parameter that can be calibrated to data, not a default isotropic prior. We exploit this in Multi-Fidelity Flow Matching (MFFM), a cascade refinement framework for parametric PDE solutions: the source is calibrated to the empirical low-to-high-fidelity residual scale with local Gaussian-blur correlation, and the velocity network is conditioned on the low-fidelity solution. Conditioning makes the residual refinement problem substantially easier than unconditional field generation, while residual-calibrated source noise improves the flow-matching training geometry. A multi-resolution cascade applies the same construction independently between adjacent fidelities. After level-wise flow-matching pretraining, we fine-tune the composed cascade end-to-end with a deterministic one-step rollout, which makes one velocity evaluation per cascade level the optimized operating point at inference. The result is a learned analog of multigrid refinement that reaches the finest grid in L deterministic network evaluations per query. We validate MFFM on eight benchmarks: two super-resolution problems and six spatiotemporal forecasting tasks from PDEBench, The Well, and the FNO Navier–Stokes dataset.
[LG-9] Scalable neuromorphic computing from autonomous spiking dynamics in a clockless reconfigurable chip
链接: https://arxiv.org/abs/2605.16114
作者: Eric Oliveira Gomes,Damien Rontani
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:We propose a scalable neuromorphic architecture based on spiking dynamics emerging from the autonomous time-continuous evolution of clockless (asynchronous) digital circuits. Implemented on commercially available field-programmable gate arrays (FPGAs), our system implements networks of interacting Boolean spiking neurons with configurable excitatory and inhibitory synaptic weights. A complete processing pipeline enables efficient handling of spike-encoded data for solving machine-learning tasks. We demonstrate competitive performance for an audio classification task with spike-based encoding and high-speed processing. Power consumption is significantly lower than traditional digital implementations; this makes our approach an efficient alternative that bridges the gap to dedicated analog neuromorphic systems without the need for specialized hardware design. More generally, our approach establishes clockless digital hardware as a viable platform for neuromorphic computing. It paves the way for reconfigurable chips to be turned into energy-efficient quasi-analog neuromorphic processors.
[LG-10] ITGPT : Generative Pretraining on Irregular Timeseries
链接: https://arxiv.org/abs/2605.16069
作者: Antoine Honoré,Ming Xiao
类目: Machine Learning (cs.LG)
*备注: 9 pages
Abstract:Timeseries regression models often struggle to leverage large volumes of labeled multimodal data, particularly when the data are irregularly sampled or contain missing values. This is common in domains like healthcare and predictive maintenance, where data are collected from unreliable sources, and labeling requires expert knowledge or costly equipments. Transformer-based large language models have proven effective on structured data such as text through self-supervised learning (SSL) and generative pretraining (GPT) frameworks. However, such models lack the flexibility to efficiently process irregularly sampled multimodal timeseries data. In this paper, we introduce ITGPT, an attention-based architecture designed for handling multimodal, irregularly sampled timeseries by allowing training with both SSL losses and GPT-like objectives. We evaluate its performance on a healthcare task with the TIHM dataset, and a predictive maintenance task with the CompX dataset. Our results demonstrate that ITGPT achieves state-of-the-art performance without requiring resampling, feature fusion or explicit data imputation. Furthermore, when labels are scarce, ITGPT effectively leverages unlabeled data through SSL and GPT training, outperforming the purely supervised approach. This represents an important step towards efficiently using large and unstructured timeseries datasets for practical inference tasks.
[LG-11] SAFE Quantum Machine Learning with Variational Quantum Classifiers
链接: https://arxiv.org/abs/2605.16067
作者: Ying Chen,Paolo Giudici,Vasily Kolesnikov,Paolo Recchia
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 8 figures
Abstract:We propose a variational quantum classifier operating on high dimensional deep representations via amplitude encoding, stabilized by a learnable classical pre encoding this http URL combining normalized amplitude embeddings with bounded quantum observables, the resulting model induces a structured and smooth hypothesis class with controlled sensitivity to input variations. Model reliability is assessed using SAFE-AI metrics derived from the Cramer von Mises divergence, enabling consistent evaluation across accuracy, robustness, and explainability dimensions. Empirical results show that the proposed quantum model provides competitive predictive performance compared with strong classical baselines while exhibiting a more balanced SAFE reliability profile, with improved robustness to noise and stability under structured feature removal. These findings suggest that variational quantum circuits offer a principled mechanism for stability oriented SAFE learning in safety critical settings.
[LG-12] Mind Dreamer: Untethering Imagination via Active Latent Intervention on Latent Manifolds
链接: https://arxiv.org/abs/2605.16030
作者: Shaojun Xu,Xiaoling Zhou,Yihan Lin,Yapeng Meng,Xinglong Ji,Luping Shi,Rong Zhao
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 34 pages, 7 figures
Abstract:Model-Based Reinforcement Learning (MBRL) leverages latent imagination for sample efficiency, yet remains constrained by Historical Tethering: imagination is typically initialized from observed states. This creates a learning asymmetry, where the world model’s manifold discovery outpaces the policy’s sparse-reward optimization. We propose Mind Dreamer (MD), a framework that operationalizes Active Latent Intervention (ALI) to transcend Markovian continuity. MD reformulates discovery as the minimization of a global Relay Manifold Expected Free Energy (R-EFE); by sampling initial states from a learned generator s_0 \sim p_gen(\cdot) rather than the historical buffer, MD utilizes an adversarial generator to synthesize non-continuous latent jumps to epistemic blind spots that are physically plausible yet cognitively challenging. To resolve the credit assignment paradox across these spatial ruptures, we derive the Relay Value Function (RVF) and Relay Uncertainty Function (RUF). These potentials treat synthesized anchors as counterfactual intermediary states, propagating pragmatic and epistemic value through a principled Bellman-style formulation. Notably, we prove that uncertainty propagation across discontinuities necessitates a quadratic discount \gamma^2 , establishing a formal epistemic horizon. Theoretically, MD approximates a variance-minimizing importance sampler that expands the manifold’s spectral gap, reducing the hitting time to critical bottleneck states. Empirically, MD achieves a 1.67 \times average speedup over DreamerV3 on DeepMind Control Suite, reaching 8.8 \times in sparse-reward tasks.
[LG-13] Variational Autoregressive Networks with probability priors
链接: https://arxiv.org/abs/2605.16020
作者: Piotr Białas,Piotr Korcyl,Tomasz Stebel,Dawid Zapolski
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Lattice (hep-lat)
*备注: 28 pages, 11 figures
Abstract:Monte Carlo methods are essential across diverse scientific fields, yet their efficiency is frequently hampered by critical slowing down-a sharp increase in autocorrelation times near phase transitions. Although deep learning approaches, such as neural-network-based samplers, have been proposed to alleviate this issue, they face another serious problem: the difficulty of training the models. This difficulty partially stems from the overly general nature of original machine-learning architectures, which often ignore underlying physical symmetries and force networks to relearn them from scratch. In this paper, we demonstrate that incorporating physical priors into the model significantly enhances performance. Building upon existing strategies that integrate spin-spin interactions, we propose a framework that utilizes a prior probability distribution as a starting point for training. Our results for the Ising model, as well as for the Edwards-Anderson spin glass model, suggest that moving away from `blank slate’ models in favor of physics-informed priors reduces the training burden and facilitates the simulation of larger system sizes in discrete spin models.
[LG-14] Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
链接: https://arxiv.org/abs/2605.16017
作者: Manuel Graca,L. Miguel Silveira,Arlindo Oliveira,Frank Liu
类目: Machine Learning (cs.LG)
*备注: 17 pages
Abstract:In this paper, we present CT-AGD (Curvature-Tuned Accelerated Gradient Descent), an optimization method for non-convex optimization problems in deep learning training tasks. CT-AGD is a general boosting procedure that accelerates first-order methods by explicitly capturing the local curvature using finite-difference quotients, and the development of heuristics aimed at mitigating noise and bias introduced by stochastic mini-batch training. CT-AGD has a comparable storage and computational overhead as adaptive gradient methods such as Adam. Our extensive experiments demonstrate that CT-AGD achieves the same level of accuracy as the baseline first-order methods, yet reduces the required training epochs by 33% on average.
[LG-15] Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning
链接: https://arxiv.org/abs/2605.16015
作者: Vishnu Saj,Sushi Vemuri,Dileep Kalathil,Moble Benedict
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads
[LG-16] Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models
链接: https://arxiv.org/abs/2605.15965
作者: Peter Clapham,Lisa Bonheme,Marek Grzes
类目: Machine Learning (cs.LG)
*备注: 13 pages, 17 figures, under review at Neural Networks
Abstract:Variational Autoencoders (VAEs) often exhibit a polarised regime in which latent variables separate into active, passive, and mixed subsets. Existing criteria for identifying active dimensions depend on a Gaussian prior, limiting their applicability to variational models and specific priors. We propose a simple information-theoretic classification of the polarised regime based on the entropy of the mean representation. We show theoretically how this entropy couples to KL minimisation through entropy–variance bounds, and we relate the resulting criterion to Bonheme’s active/passive conditions. We also clarify a key limitation: entropy of the mean alone cannot reliably distinguish active from mixed dimensions without additional signals from the variance representation. Empirically, we evaluate the entropy criterion on \beta -VAEs, identifiable VAEs, Least-Volume Autoencoders, and L2-regularised autoencoders, and find that it consistently recovers a polarised regime when such a regime is present across the model classes studied. Finally, we show that passive dimensions can yield small but consistent improvements on downstream tasks when latent codes are appropriately normalised, suggesting that collapse is often a matter of scale rather than absolute information removal.
[LG-17] FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy
链接: https://arxiv.org/abs/2605.15944
作者: Qian He,Zhenshuo Yang,Wenqi Liang,Chunhui Hao,Nicu Sebe,Jiandong Tian
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored campling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: this https URL
[LG-18] A Retrieval-Enhanced Transformer for Multi-Step Port-of-Call Sequence Prediction in Global Liner Shipping
链接: https://arxiv.org/abs/2605.15937
作者: Yanzhao Su,Fang He,Yineng Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate multi-step port-of-call sequence prediction is vital for tactical resource orchestration and logistical efficiency. However, existing methods struggle with unreliable voyage schedules and the inability of AIS data to provide visibility beyond the immediate next port. To address this, this study proposes a Connectivity-Constrained and Retrieval-Enhanced (CCRE) deep learning framework. Inspired by Retrieval-Augmented Generation, CCRE introduces a retrieval-enhanced historical encoder that queries a global maritime database for contextually similar navigational precedents. Transforming these scenarios into candidate-level semantic representations compensates for data sparsity in long-tail routes and resolves routing ambiguities. Integrating this with a Transformer-based trajectory encoder, the architecture executes adaptive “middle fusion” via cross-attention. This dynamically shifts predictive reliance from real-time kinematics for short-term accuracy to historical context for long-term strategic stability. To ensure sequence-level coherence, forecasting is formulated as a joint sequence generation problem using an autoregressive Transformer decoder enriched with Scheduled Sampling and Gumbel-Softmax relaxation. This mitigates error accumulation, while topology masks strictly enforce maritime network reachability to eliminate operationally infeasible routes. Evaluated on a global dataset, CCRE achieves a 72.3% first-destination accuracy and a 61.4% average three-step accuracy, outperforming baselines like CatBoost and LSTM by average margins of 12.6% and 11.3%, respectively. Case studies further corroborate the model’s scalability and ability to capture complex routing patterns across diverse international trade lanes.
[LG-19] Context-aware Entity-Relation Extraction for Threat Intelligence Knowledge Graphs
链接: https://arxiv.org/abs/2605.15904
作者: Inoussa Mouiche,sherif Saad
类目: Machine Learning (cs.LG)
*备注: 16 pages
Abstract:Cybersecurity Knowledge Graphs (CKGs) unify diverse Cyber Threat Intelligence (CTI) sources into structured, queryable formats, offering scalable solutions for automating proactive and real-time security responses. Their increasing adoption has significantly enhanced the workflow and decision-making efficiency of security professionals. However, constructing CKGs requires extracting entity-relation triples from unstructured CTI reports, a task hindered by complex report structure, domain-specific language, and semantic ambiguity. As a result, existing pipeline-based approaches often suffer from error propagation, reducing extraction accuracy and limiting generalizability. This paper introduces the Context-aware Threat Intelligence Knowledge Graph (CTiKG) framework, a pipeline architecture designed to accurately extract and classify threat entities and their relationships from CTI reports. CTiKG incorporates hybrid NLP models that leverage SecureBERT+ contextual embeddings and expert knowledge from a domain ontology to reduce misclassifications and mitigate cascading errors. Experiments on the DNRTI-AUG-STIX2 dataset, which comprises 21 entity types aligned with STIX 2.1, demonstrate significant improvements over state-of-the-art baselines, yielding 3-4% gains in NER and up to 8% in RE performance, based on precision, recall, and F1-score. Additional validation on DNRTI and STUCCO benchmarks confirms the framework’s robustness and practical applicability. All datasets, including the curated DNRTI-AUG-STIX2, are released on GitHub to foster reproducibility and further research.
[LG-20] From Layers to Networks: Comparing Neural Representations via Diffusion Geometry
链接: https://arxiv.org/abs/2605.15901
作者: Atharva Khandait,Jan E. Gerken
类目: Machine Learning (cs.LG)
*备注: 11 pages + appendices
Abstract:Diffusion geometry is a manifold learning framework that uses random walks defined by Markov transition matrices to characterize the geometry of a dataset at multiple scales. We use diffusion geometry for neural representations, incorporating tools from multi-view learning into this field for the first time. Our key technical observation is that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices, opening the door to manipulations from diffusion geometry. As a first application, we develop multi-scale variants of Centered Kernel Alignment and Distance Correlation, which utilise the t^th power of the underlying transition matrix to probe the data geometry at adjustable diffusion scales. Going further, we introduce variants of these measures which fuse the Markov matrices of several layers via alternating diffusion into a single operator that captures the network’s joint sample geometry, allowing similarity to be computed across multiple layers and shifting the comparison from layer-to-layer to network-to-network. We perform extensive numerical experiments, evaluating our measures on the Representational Similarity (ReSi) benchmark comprising 14 architectures trained on 7 datasets across three different domains. Our methods achieve SoTA results in accuracy and output correlation for both language and vision tasks across different models. We furthermore show SoTA performance on an additional benchmark evaluating on out-of-distribution data.
[LG-21] A Multi-Layer Cloud-IDS Pipeline with LLM and Adaptive Q-Learning Calibration
链接: https://arxiv.org/abs/2605.15889
作者: Syed Waqas Ali,Ibrar Ali Shah,Farzana Zahid,Daniyal Munir,Hans D. Schotten
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Security in cloud computing has become a major concern due to several factors such as layered cloud architectures, dynamic environments, and exposure to unseen or zero-day attacks. Moreover, intrusion detection systems (IDS) typically operate at specific layers and rely heavily on machine learning models, which often perform well in experimental settings but fail to sustain performance in real cloud deployments. In this work, we implement a confidence-aware multilevel intrusion detection system using reinforcement learning tailored for cloud environments. The system secures three distinct layers: network, host, and hypervisor. Machine learning models at each layer detect known attack patterns, while prediction confidence distinguishes reliable decisions from uncertain outcomes. Within the multi-gate flow, low-confidence events pass through a learned-threshold confidence gate (Gate-1), followed by a Chroma memory-matching gate (Gate-2), with unresolved events escalated to a large language model (LLM) for semantic analysis and explanation. Final attack promotion at Gate-3 uses calibrated LLM confidence or weighted-fusion fallback, while uncertain events are retained in a review bucket to avoid forced classification. Generated explanations and confirmed knowledge are stored in ChromaDB to support future analysis and retraining. The approach is first evaluated using static thresholds, establishing a baseline for comparison. Results show that the proposed system learns adaptive thresholds and reduces LLM escalation by 58.78%, lowering cost while maintaining strong performance (88.68% accuracy, 85.29% precision, 84.72% recall, 85.00% F1). The network and hypervisor layers achieve 98.02% and 97.08% accuracy, demonstrating a balanced and efficient detection system.
[LG-22] Practical Validity Conditions for Byzantine-Tolerant Federated Learning
链接: https://arxiv.org/abs/2605.15887
作者: Mélanie Cambus,Darya Melnyk,Tijana Milentijević,Stefan Schmid
类目: Machine Learning (cs.LG)
*备注:
Abstract:Robust aggregation is the core operation in Byzantine-tolerant federated learning. To ensure the quality of aggregation independently of data distribution or attacks, validity conditions are needed. They provide geometric guarantees of where the output of the aggregation must lie. The widespread convex validity requires the output to lie in the convex hull of the honest vectors. Although this guarantee is strong in theory, it is poorly suited to modern federated learning systems, as it has dimension-dependent resilience and excludes many practical aggregation rules. We introduce the minimum enclosing ball (MEB) validity condition for robust aggregation, as well as its multiplicative relaxation, c -MEB validity, where c is a constant. We show that exact MEB validity still suffers from limited resilience, while relaxed c -MEB validity is achievable if a majority of clients is honest, i.e. n2t . We give an optimal MinMax-MEB rule for the relaxed condition with the bound c\sqrt2 and prove explicit relaxed-MEB guarantees for standard aggregators including minimum-diameter averaging, medoid and geometric median. Finally, we relate MEB validity to convex, relaxed-convex and box validity studied in prior literature, thus providing a systematic map of geometric validity conditions for Byzantine-robust aggregation. Our results show that relaxed MEB validity connects validity conditions in distributed computing and Byzantine-tolerant aggregation rules, and offers a practical alternative to convex validity. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.15887 [cs.LG] (or arXiv:2605.15887v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15887 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-23] -iLSTM: A TinyDL Approach for Logic-Level Anomaly Detection in Industrial Water Treatment Systems
链接: https://arxiv.org/abs/2605.15874
作者: Mandar Joshi,Farzana Zahid,Judy Bowen,Matthew M.Y. Kuo,Valeriy Vyatkin,Emil Karlsson
类目: Machine Learning (cs.LG)
*备注: NA
Abstract:Industrial Water Treatment Systems (IWTS) are safety critical cyber-physical infrastructures and due to increased connectivity, these systems are exposed to cyber threats that can manipulate process behaviour without creating obvious devices outliers. In particular, logic-layer deception anomalies can preserve numerically plausible measurements while breaking expected cause-and-effect relationships in the control process. These attacks are difficult to detect using threshold-based monitoring or require heavy server-oriented anomaly detection models. This paper explores the potential of Tiny Deep Learning (TinyDL) to provide lightweight on-device logic-level anomaly detection for resource constrained Programmable Logic Controllers (PLCs). We propose a novel framework, TinyDL-based incremental LSTM (Ti-iLSTM) which optimises the memory and space foot print of Long Short-Term Memory (LSTM), to detect logic-layer inconsistencies in Programmable Logic Controller (PLC) based Industrial Water Treatment Systems (IWTS). Experiments on the publicly available SWaT dataset show that the optimised model achieves high detection performance (F1-score=0.983 and ROC-AUC=0.998). A deployment-style validation on the WADI dataset confirms that the proposed light-weight framework remains applicable beyond a single dataset. The research demonstrates that combining logic-aware supervision with Tiny Deep Learning (TinyDL) sequence learning creates an efficient and accurate anomaly detection suitable for resource constrained Programmable Logic Controllers (PLCs) in industrial environments.
[LG-24] From Observed Viability to Internal Predictive Approximation: A Single-Subject Latent-Space Analysis of Gait Dynamics Under Occlusal Constraint CCL
链接: https://arxiv.org/abs/2605.15862
作者: Jacques Raynal,Pierre Slangen,Elsa Raynal,Jacques Margerit
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 31 pages, 1 figure, 9 tables. Exploratory single-subject study combining gait analysis, occlusal observational probes, PCA-based latent-space modeling, and supervised predictive approximation
Abstract:Adaptive biomechanical systems may show similar observable gait performance while differing in latent organization and longitudinal behavior. This study examines whether an observed longitudinal transformation of gait organization can be approximated within a predictive latent-space framework, without claiming clinical prediction or causal occlusal effects. Using an exploratory single-subject design in a Parkinsonian participant, gait was recorded with instrumented insoles during two sessions separated by eleven weeks. Six occlusal observational probes were tested: natural occlusion, open-mouth disengagement, strong clenching, two vertical-dimension increases in centric relation, and one vertical-dimension increase with mandibular protrusion. Principal Component Analysis was used to construct a PC1–PC2 latent representation. A simplified supervised machine-learning model, implemented as a feed-forward neural network, was trained to approximate the observed M1–M2 transformation. The primary analysis focused on the three centric-relation conditions and tested whether the displacement hierarchy could be reproduced. The model preserved the ordering OC3 ONL OC2.5. The extended six-probe analysis also preserved the global structure of the exploratory displacement pattern, with OC3 and OC3P closely grouped and the highest displacements associated with OC2.5 and open-mouth disengagement. Held-out M2 and leave-condition-out analyses showed condition-dependent approximation variability. These findings do not establish generalizable prediction, therapeutic superiority, causal occlusal effects, or clinical viability forecasting. They support only the restricted conclusion that observed longitudinal latent transformations can be internally approximated within this single-subject dataset, providing a methodological bridge toward future multi-subject predictive viability models. Comments: 31 pages, 1 figure, 9 tables. Exploratory single-subject study combining gait analysis, occlusal observational probes, PCA-based latent-space modeling, and supervised predictive approximation Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2605.15862 [cs.LG] (or arXiv:2605.15862v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15862 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jacques Raynal [view email] [v1] Fri, 15 May 2026 11:23:44 UTC (982 KB)
[LG-25] Complexity of Non-Log-Concave Sampling in Fisher Information
链接: https://arxiv.org/abs/2605.15859
作者: Sinho Chewi,Andre Wibisono
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We study the query complexity of obtaining a relative Fisher information guarantee for sampling from a log-smooth non-log-concave distribution; this is a sampling analog of finding an approximate stationary point in optimization. Our algorithm is based on the proximal sampler, which is an implicit discretization of the Langevin diffusion, and requires an implementation of the backward step known as the restricted Gaussian oracle (RGO). We show that by leveraging the recent results for log-concave sampling with high-accuracy guarantees in Rényi divergence, we can obtain an approximate RGO implementation that – when used with the proximal sampler – yields a complexity guarantee in relative Fisher information that inherits the same dimension dependence as log-concave sampling, and improves upon prior work for non-log-concave sampling. We also show a converse reduction that any improvement in the dimension dependence in relative Fisher information for non-log-concave sampling will yield an improved dimension dependence for high-accuracy log-concave sampling.
[LG-26] Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverag e
链接: https://arxiv.org/abs/2605.15832
作者: Júlia Orteu Aubach,Fabio Banchelli,Marc Clascà Ramírez,Marta Garcia-Gasulla
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:
Abstract:This work extends a framework for predicting the performance of High-Performance Computing (HPC) workloads using Machine Learning (ML). A common limitation in performance modeling is the restricted number of hardware counters that can be collected simultaneously. To address this, we propose a heuristic-based methodology to merge execution traces from multiple runs, each instrumented with a different set of hardware counters. Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns. This process enables the construction of a unified dataset that includes a wider set of hardware features without relying on multiplexing. The output is a new synthetic trace with all merged counters, which can be used both for HPC performance prediction and for conventional performance analysis. The methodology has been validated on MareNostrum5 machine with a range of kernels and real applications. Results show that the merged counters maintain acceptable accuracy depending on the application, and can be directly used to train ML models on a richer feature space without prior counter selection.
[LG-27] Intrinsic Wasserstein Rates for Score-Based Generative Models on Smooth Manifolds
链接: https://arxiv.org/abs/2605.15822
作者: Guoji Fu,Taiji Suzuki,Wee Sun Lee,Atsushi Nitanda
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Score-based generative models are trained in high-dimensional ambient spaces, yet many data distributions are supported on low-dimensional nonlinear structures. We prove that, for compact d -dimensional smooth manifolds \mathcalM \subset [0,1]^D with d 2 and \beta -Hölder densities strictly positive on \mathcalM , a variance-preserving SGM estimator attains the intrinsic Wasserstein–1 sample exponent \tilde\mathcalO(D^\mathcalO_\beta(d)n^-(\beta+1)/(d+2\beta)) , up to logarithmic factors and explicit geometry and density factors. The full nonasymptotic bound explicitly isolates the finite-order geometry envelope, Hölder radius, density lower bound, ambient dependence, and finite-order correction terms. The analysis separates score approximation into a large-noise tangent-cell regime and a small-noise projection-centered, de-Gaussianized Laplace regime. The key technical ingredient is a ReLU implementation of nearest-projection coordinates via finite intrinsic anchors and Gauss–Newton iterations, rather than approximating the manifold projection as a black-box high-dimensional smooth map. Consequently, for families with polynomially controlled geometry and density lower bounds, the constructed score-network parameters have polynomial ambient dependence.
[LG-28] Martingale Neural Operators: Learning Stochastic Marginals via Doob-Meyer Factorization
链接: https://arxiv.org/abs/2605.15806
作者: Kai Hidajat
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators excel as deterministic surrogates, but inevitably collapse to the conditional mean when applied to stochastic PDEs, discarding the variance and tail structure upon which uncertainty quantification depends. Recovering this structure typically requires Monte Carlo rollouts or grafted generative models, both of which surrender the one-shot efficiency and resolution invariance that define the operator paradigm. To resolve this, we draw on the Doob-Meyer theorem, which establishes that any semimartingale fundamentally decomposes into a predictable drift and an unpredictable, zero-mean martingale. Translating this theorem into an architectural prior, we introduce the Martingale Neural Operator (MNO). MNO maps an initial condition directly to the conditional mean and covariance of the terminal law, parameterized by a drift-like mean and a low-rank factor B_\phi with B_\phi^\top B_\phi positive semi-definite by construction. For our experiments, we use a Gaussian residual instantiation. Across 1D SPDEs, rough volatility, and 2D operator tasks, MNO reduces Wasserstein distance by up to 120\times on \phi^4 field theory and 68\times on stochastic Burgers, evaluating \sim 3\times faster than a conditional diffusion baseline at matched wall-clock training budgets. On 2D tasks, MNO is comparable to FNO on zero-shot resolution transfer and turbulent flow, while quasi-deterministic systems such as Gray-Scott remain a failure mode.
[LG-29] AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training
链接: https://arxiv.org/abs/2605.15793
作者: Qitan Lv,Hong Wang,Zhongkai Hao,Wen Wu,Xuenan Xu,Bowen Zhou,Feng Wu,Chao Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-training neural operators on diverse partial differential equation (PDE) datasets has emerged as a promising direction for building general-purpose surrogate models in scientific machine learning. However, the inherent complexity and structural diversity of PDE solution operators make multi-PDE pre-training fundamentally challenging. Existing methods mainly address this by increasing model capacity, while leaving the target solution operators unchanged. Inspired by classical numerical analysis, we instead propose to transform complex and diverse solution operators into simpler, better-aligned forms that are easier to model jointly. Since the optimal transformation varies across PDE types, it must be adaptive and input-dependent, allowing a single neural operator to approximate an entire family of operators. We instantiate this idea as AOT-POT (adaptive operator-transformation for pre-training operator transformer), which expands hidden representations into multiple parallel streams, adaptively aggregates and redistributes them before and after each sub-layer, and mixes streams through Sinkhorn-projected doubly stochastic matrices for stable training. These mechanisms together reshape diverse solution operators into a unified form that can be effectively modeled by a single architecture. Empirically, AOT-POT achieves state-of-the-art performance on 12 PDE benchmarks with only 3% additional parameters, reducing relative L2 error by up to 77.6% (40.9% on average). Fine-tuning AOT-POT further reduces L2 error by up to 92% on in-domain PDEs and 89% on out-of-domain PDEs (unseen types during pre-training), demonstrating that adaptive operator transformation is an effective and complementary direction for advancing PDE foundation models beyond simply scaling model capacity.
[LG-30] Learning Context-conditioned Gaussian Overbounds for Convolution-Based Uncertainty Propagation
链接: https://arxiv.org/abs/2605.15789
作者: Ruirui Liu,Xuejie Hou,Yiping Jiang,Hui Ren
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Uncertainty quantification is essential in safety-critical settings–from autonomous driving to aviation, finance, and health–where decisions must rely on conservative bounds rather than point estimates. Predictor-level intervals (e.g., from quantile regression, conformal prediction, variance networks, or Bayesian models) generally do not compose: adding two per-variable intervals need not yield a valid interval for their sum or preserve coverage. In aviation, Gaussian overbounding replaces complex error distributions with a conservative Gaussian whose tails dominate the truth, so conservatism propagates through linear operations. Yet classical overbounds are global, often overly conservative, and hard to adapt to feature-conditioned errors. We propose a unified learning framework that trains neural networks to produce context-aware Gaussian overbounds–mean and scale–with provable conservatism on a finite quantile grid and, under three explicit regularity assumptions, continuous-tail conservatism on a certified interval. Our overbounding loss enforces conservativeness at selected quantiles while penalizing distributional distance with a Wasserstein-style term. The learned bounds support conservative linear-combination and convolution analysis on the enforced grid, and on the certified interval when assumptions hold, while being less redundant than traditional methods. We provide a scoped analysis of discrete-to-continuous conservatism and compact-domain objective regularity, and validate on synthetic data and real-world datasets, including multipath, ionospheric, and tropospheric residual errors. Across these settings, the method yields tighter bounds while maintaining conservatism on the enforced grid and in experiments. The framework is modality-agnostic and applicable to learning systems that require conservative, feature-conditioned uncertainty estimates in dynamic environments.
[LG-31] ADAPT: A Self-Calibrating Proactive Autoscaler for Container Orchestration
链接: https://arxiv.org/abs/2605.15788
作者: Himanshu Singh Baghel
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 3 tables. Includes reproducible simulation framework for proactive Kubernetes autoscaling with adaptive cold-start estimation and MPC-based scaling. Source code and experiment configurations available at: this https URL
Abstract:Proactive autoscaling for containerized workloads depends on knowing the provisioning delay, i.e., the time between a scaling decision and the moment new capacity is ready to serve traffic. In practice, this cold-start duration can vary substantially across environments and even across consecutive scale-out events. We present ADAPT (Adaptive Duration Approximation for Predictive Timing), an online EWMA estimator that tracks coldstart duration at runtime. ADAPT feeds a dynamic planning horizon, FH-OPT, into a Model Predictive Controller (MPC) that optimizes replica counts over a rolling window. Together, these components form a closed-loop proactive autoscaling design that adapts its lookahead based on measured provisioning delay. Evaluated across three policies (MPC+LSTM, MPC+Prophet, HPA) and six workload archetypes with five random seeds, MPC+LSTM achieves below 5% SLA violation on all workloads, compared with 7-19% for reactive HPA and up to 28.7% for MPC+Prophet on bimodal traffic.
[LG-32] Continual Learning of Domain-Invariant Representations ICML2026
链接: https://arxiv.org/abs/2605.15775
作者: Pascal Janetzky,Tobias Schlagenhauf,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Continual learning (CL) aims to train models sequentially over multiple domains without forgetting previously learned knowledge. However, existing CL methods optimize for in-domain performance and are therefore prone to learning spurious, domain-specific cues (``shortcut learning’'), which limits generalization to unseen domains after deployment. In this paper, we address this limitation through continual learning of domain-invariant representation. We introduce a broad class of CL methods that sequentially learn representations capturing invariant structures across domains. Our methods are motivated by the observation that such invariant structures often preserve the underlying causal mechanisms, which can reduce the risk of overfitting to domain-specific cues and thus offer better out-of-domain generalization. Our proposed CL methods combine replay-based training with a tailored sequential invariance alignment to learn – and preserve – invariant structures over time. We evaluate our methods under a deployment-oriented protocol that measures performance on unseen target domains. Across six benchmark and real-world datasets spanning vision, medicine, manufacturing, and ecology, our methods consistently outperform existing CL baselines in terms of generalization to unseen target domains. As an ablation, we further show that naïve extensions of sequential training with existing domain-invariant representation learning (DIRL) methods provide only limited benefits. To the best of our knowledge, this is the first work to develop domain-invariant representation methods for CL.
[LG-33] A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
链接: https://arxiv.org/abs/2605.15761
作者: Hosna Oyarhoseini,Jimmy Lin,Amir-Hossein Karimi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remains poorly understood. We present a unified perturbation framework for analyzing Bradley-Terry leaderboards under structured data modifications using influence-based approximations. Our framework studies three match-level perturbations – Drop, Add, and Flip – together with player removal, and evaluates their effects on top-k membership, global ranking consistency via Kendall’s tau, and confidence-interval-based uncertainty. Across Chatbot Arena and six additional pairwise-comparison datasets, we show that modern leaderboards are non-robust across all three objectives: sub-1% targeted perturbations can change the top-ranked model, degrade Kendall’s tau, and alter confidence intervals. Beyond robustness auditing, we show that the same influence scores enable efficient targeted perturbations, promoting or demoting specific models and reducing target-model uncertainty with fewer actions than previous manipulation and active-sampling baselines. By summarizing these effects with normalized dataset-level robustness scores, our framework provides a practical and helpful tool for auditing leaderboard stability and motivating more robust evaluation protocols.
[LG-34] Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models
链接: https://arxiv.org/abs/2605.15706
作者: Xingjian Wu,Junkai Lu,Siyu Yan,Xiangfei Qiu,Jilin Hu,Chenjuan Guo,Bin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Large Language Models (LLMs) have catalyzed the development of multi-agent systems (MAS) for complex reasoning tasks. However, existing MAS typically rely on pre-defined or pre-compiled communication topologies, which limits their flexibility and adaptability to dynamic task requirements. In this work, we propose Differentiable Mixture-of-Agents (DMoA), a self-evolving multi-agent framework that enables elastic and adaptive agent collaboration during inference. Instead of statically constructing workflows, DMoA dynamically routes and activates agents at each reasoning step, allowing the system to implicitly simulate diverse communication topologies and adapt to evolving demands. To achieve this, we design a differentiable, context-aware routing mechanism that leverages recurrent structures to incorporate historical and contextual information, producing sparse agent activations in a step-wise manner. Furthermore, we introduce predictive entropy as self-supervised signals to optimize the routing process, enabling efficient test-time adaptation without external annotations. Extensive experiments across 9 benchmarks demonstrate that DMoA achieves state-of-the-art performance while exhibiting strong efficiency, robustness, and ensembling capabilities.
[LG-35] AGOP-IxG: A Gradient Covariance Filter for Local Feature Attribution on Tabular Data with a Controlled Benchmark
链接: https://arxiv.org/abs/2605.15700
作者: Raj Kiran Gupta Katakam
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 3 tables. Submitted to AutoML Conference 2026 (ABCD Track)
Abstract:Automated machine learning pipelines increasingly produce models whose predictions must be explained to end users, auditors, and downstream decision systems. The most widely used feature attribution methods (SHAP, Integrated Gradients, LIME) are typically chosen by convention rather than measured fidelity, because rigorous evaluation is impeded by the absence of ground-truth attribution on real data. We propose AGOP-IxG, a fast per-sample attribution method for tabular classifiers that pre-multiplies the per-sample gradient by a top- K rank-truncated Average Gradient Outer Product matrix, and evaluate it against four widely-used baselines on a controlled tabular benchmark designed for AutoML practitioners. In Part 1, we construct three synthetic multi-class tabular tasks (linear, sparse nonlinear, interaction-based) where ground-truth attribution per sample is analytically or numerically derivable, and compare five methods: AGOP-IxG, SHAP (DeepExplainer), Integrated Gradients, InputXGradient, and LIME. AGOP-IxG leads on Spearman rank correlation and noise feature mass on all three synthetic datasets, and on top- k precision on the interaction dataset. Across all settings, AGOP-IxG is approximately 350\times to 1,650\times faster than SHAP. In Part 2, we evaluate global faithfulness on Adult Income and Credit Card Default using the ROAR protocol; the methods cluster within \sim 1.7% relative AUC, consistent with AGOP-IxG being optimized for per-sample local attribution rather than global feature ranking.
[LG-36] Going Beyond the Edge: Distributed Inference of Transformer Models on Ultra-Low-Power Wireless Devices
链接: https://arxiv.org/abs/2605.15694
作者: Alexander Gräfe,Ding Huo,Johannes Berger,Marco Zimmerling,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformer models are rapidly becoming a cornerstone of modern Internet of Things (IoT) applications, yet their computational and memory demands far exceed the capabilities of a single typical ultra-low-power IoT device. We present CATS, a framework for distributed transformer inference on ultra-low-power wireless devices, enabling multiple devices to collaboratively execute models far larger than what a single device can sustain. At its core, CATS is a communication-aware distributed transformer inference scheme co-designed across transformer partitioning, wireless communication and training. It employs SomeGather, a new pruned communication primitive that selectively broadcasts activation columns to reduce communication bandwidth and RAM usage without sacrificing model accuracy. Building on SomeGather, we design a partitioning method that exploits this primitive for efficient model parallelism. To cope with unreliable wireless communication, CATS employs message-dropout during training, which mimics packet losses and yields models that are robust to message loss during inference. In real-world experiments, we show that CATS brings distributed transformer inference to ultra-low-power wireless devices for the first time, with deployments on up to 16 devices that collaboratively execute transformer models up to 14 times larger than what a single device can run.
[LG-37] ghter Regret Bounds for Contextual Action-Set Reinforcement Learning
链接: https://arxiv.org/abs/2605.15692
作者: Zijun Chen,Zihan Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, \sum_k=1^K [V^*,M^k - V^\pi^k,M^k] , where M^k represents the action context in the k -th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of \widetildeO(\sqrtSAH^3K\log L) for adversarial contexts, where L denotes the number of possible contexts. This result implies a regret bound of \widetildeO(\sqrtSAH^3K) for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of \widetildeO(SAH^3/\epsilon^2) for a fixed context distribution. In addition, we derive a gap-dependent regret bound of [ \widetilde O\left( \inf_p\in [0,1) \left( \frac1\Delta_\min^p + pK\Delta_\min^p \right)\log K \cdot \mathrmpoly(S,A,H) \right), ] where \Delta_\min^p is the global p -trimmed positive-gap floor over suboptimal (h,s,a) triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.15692 [cs.LG] (or arXiv:2605.15692v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15692 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-38] SEED: Targeted Data Selection by Weighted Independent Set
链接: https://arxiv.org/abs/2605.15691
作者: Yuan Zhang,Lifeng Guo,Junwen Pan,Chang Liu,Wenzhao Zheng,Kuan Cheng,Kurt Keutzer,Shanghang Zhang
类目: Machine Learning (cs.LG)
*备注: 20 pages
Abstract:Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textitnode value calibration that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textitlocal scale normalization that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \textttHoneybee-Remake-SEED-200K, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.
[LG-39] FRWKV: Adaptive Periodic-Position Branch Interaction for Frequency-Space Linear Time Series Forecasting
链接: https://arxiv.org/abs/2605.15690
作者: Qingyuan Yang,Dongyue Chen,Da Teng,Junhua Xiao,Jiaji Pan,Shizhuo Deng
类目: Machine Learning (cs.LG)
*备注: this https URL
Abstract:Long-term time series forecasting is essential for decision making in energy, finance, transportation, and healthcare systems. Recent lightweight forecasting models improve efficiency by operating in transformed or linearized spaces, but two challenges remain in frequency-space forecasting. The real and imaginary streams of complex spectra contain complementary information that is often weakly exchanged, and periodic-position cues can help recurring patterns only when they are reliable for the current dataset and prediction horizon. To address these challenges, we propose FRWKV+, an enhanced FRWKV forecasting model for selective periodic-position branch interaction. FRWKV+ first introduces cross-branch gates that exchange compact contexts between the real and imaginary frequency streams, allowing each stream to modulate the other. It then uses the Adaptive PhaseGate mechanism to extract periodic-position context and generate signed corrections to these gates. An adaptive trust mechanism controls the correction strength at the sample, variable, and channel levels, so periodic-position information is admitted as a reliable correction signal while preserving the efficiency of the FRWKV backbone. External benchmark tables report a separately labeled FRWKV-family selected system for manuscript-level comparison, while mechanism-level claims are based on strict matched-seed FRWKV-family ablations and representative component-level ablations. Under this matched protocol, FRWKV+ achieves the largest MSE winner coverage among the family variants and provides clear gains in selected periodic regimes. Component analysis further supports the usefulness of periodic-position context, signed correction, and adaptive trust in these regimes, while revealing boundary cases where simpler correction rules remain preferable.
[LG-40] Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation ICML2026
链接: https://arxiv.org/abs/2605.15669
作者: Jinuk Kim,Junsoo Byun,Donghwi Hwang,Seong-Jin Park,Hyun Oh Song
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Manufacturable chip layouts must satisfy thousands of geometry-based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent’s input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain. We release the code at this https URL.
[LG-41] On the Power of Adaptivity for varepsilon-Best Arm Identification in Linear Bandits COLT2026
链接: https://arxiv.org/abs/2605.15663
作者: Arnab Maiti,Yunbei Xu,Kevin Jamieson
类目: Machine Learning (cs.LG)
*备注: Accepted at COLT 2026
Abstract:We study the minimax sample complexity of \varepsilon -best arm identification in linear bandits. Given a compact action set \mathcalX that spans \mathbbR^d and an unknown reward vector \theta\in\mathbbR^d , the goal is to output an arm \widehatx\in\mathcalX such that \langle \widehatx,\theta\rangle \ge \max_x\in\mathcalX \langle x,\theta\rangle - \varepsilon with probability at least 1-\delta , using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity \mathcalO!\left(\fracd\log(1/\delta)\varepsilon^2+\fracw(\mathcalX)^2\varepsilon^2\right) , where w(\mathcalX) is a Gaussian width term dependent on \mathcalX , and we prove a matching lower bound \Omega!\left(\fracd\log(1/\delta)\varepsilon^2+\fracw(\mathcalX)^2\varepsilon^2\right) for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the \ell_2 ball, m -sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set \mathcalX for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an \ell_2 -norm estimation subroutine: we design an adaptive algorithm that uses \mathcalO!\left(\fracd\log(1/\delta)\varepsilon^2\right) samples from the unit \ell_2 ball in \mathbbR^d and outputs an estimate \widehat r satisfying |\widehat r-|\theta|_2|\le \varepsilon with probability at least 1-\delta , where \theta is the unknown reward vector. Comments: Accepted at COLT 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.15663 [cs.LG] (or arXiv:2605.15663v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15663 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arnab Maiti [view email] [v1] Fri, 15 May 2026 06:35:31 UTC (53 KB)
[LG-42] owards Code-Oriented LM Embeddings for Surrogate-Assisted Neural Architecture Search GECCO2026
链接: https://arxiv.org/abs/2605.15649
作者: Pranav Somu,Advay Balakrishnan,Stepan Kravtsov,Aaron McDaniel,Jason Zutty
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: This is an extended version of work accepted to GECCO 2026. Our code is available at this https URL
Abstract:Developing effective surrogates (performance predictors) for Neural Architecture Search (NAS) typically requires expensive fine-tuning or the engineering of complex representations. We propose a low-cost embedding strategy that leverages the inductive bias of Language Models (LMs) to eliminate these overheads. By representing architectures as PyTorch class definition text, we demonstrate that off-the-shelf LMs act as competitive feature extractors without NAS-specialized fine-tuning. The final predictor is constructed by passing the extracted Code-Oriented LM Embeddings (COLE) through a lightweight regression head. We also investigate strategies to improve embedding quality and utilization. Our experiments on the NAS-Bench-201 and einspace search spaces reveal that raw code inputs yield higher predictive performance than other text-based encodings (e.g., ONNX-to-text encodings) when using frozen LMs. We also observe COLE drives superior surrogate-assisted search using the BANANAS algorithm in NAS-Bench-201. When optimizing for CIFAR-100 performance, replacing structural path encodings with COLE for architecture representation allows for a 34% decrease in the evaluation budget required to reach within 1% of the fittest architecture in the search space (by test accuracy). As any neural architecture can be represented as code, these findings establish COLE as a versatile and efficient foundation for advancing NAS.
[LG-43] Perforated Neural Networks for Keyword Spotting
链接: https://arxiv.org/abs/2605.15647
作者: Vishy Gopal,Aris Ilias Goutis,Ralph Crewe,Erin Yanacek,Rorry Brenner
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 1 figure, 800-trial hyperparameter sweep; Best Model award, Edge Impulse 2025 Hackathon
Abstract:Edge machine learning presents a unique set of constraints not encountered in cloud-scale model deployment: strict memory budgets, limited compute, and non-negotiable accuracy thresholds must all be satisfied simultaneously. Existing compression and optimization techniques can trade one resource for another, but rarely improve both accuracy and model size at the same time. This paper presents the application of Perforated Backpropagation to keyword spotting on the Edge Impulse platform, an experiment that won the Best Model award at the Edge Impulse 2025 Hackathon in December 2025. By adding artificial Dendrite Nodes to a standard convolutional neural network trained on the Edge Impulse keyword spotting tutorial pipeline, we demonstrate that dendritic models outperform traditional architectures at every level of parameter count and at every accuracy threshold tested across 800 hyperparameter trials. The best dendritic model achieved a test accuracy of 0.933 with only 1,500 parameters, versus the baseline accuracy of 0.921 requiring approximately 4,000 parameters. These results suggest that Perforated Backpropagation is a powerful addition to the edge AI engineer’s toolkit, offering simultaneous gains in both model quality and deployment efficiency.
[LG-44] IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
链接: https://arxiv.org/abs/2605.15626
作者: Ali Abbasi,Chayne Thrash,Haoran Qin,Hamed Pirsiavash,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at this https URL
[LG-45] Position: Zeroth-Order Optimization in Deep Learning Is Underexplored Not Underpowered
链接: https://arxiv.org/abs/2605.15622
作者: Sijia Liu,Yicheng Lang,Soumyadeep Pal,Changsheng Wang,Yancheng Huang,Chongyu Fan,James Diffenderfer,Bhavya Kailkhura,Yihua Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.
[LG-46] ransformer-like Inference from Optimal Control
链接: https://arxiv.org/abs/2605.15608
作者: Aditya Kudre,Heng-Sheng Chang,Prashant G. Mehta
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preprint
Abstract:Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.
[LG-47] Gaussian Relational Graph Transformer
链接: https://arxiv.org/abs/2605.15575
作者: Zezhong Ding,Jin Li,Xugang Wang,Xike Xie
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Relational graph learning models relational databases as graphs and has demonstrated superior performance on a wide range of relational predictive tasks. However, existing methods struggle to capture long-range dependencies due to information decay in their message-passing mechanisms, and recent relational graph transformers remain limited in jointly modeling structural, semantic, and temporal information. In this paper, we propose GelGT, a Gaussian relational graph transformer that explicitly addresses these challenges. GelGT introduces a structure-semantic collaborative sampling strategy to preserve structural connectivity while filtering irrelevant semantic information, and incorporates a Gaussian graph attention mechanism with a learnable Gaussian bias on the sampled subgraphs to dynamically encode temporal dependencies. Extensive experiments on various real-world datasets demonstrate that GelGT achieves state-of-the-art downstream task performance, with up to a 13.8% improvement in predictive performance.
[LG-48] CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography
链接: https://arxiv.org/abs/2605.15564
作者: Minseo Kim,Huanghao Mai,Jay Shenoy,Alec Follmer,Gordon Wetzstein,Frederic Poitevin
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Image and Video Processing (eess.IV)
*备注: Project page: this https URL
Abstract:Generative models trained on public databases of protein structures, most of which have been determined by X-ray crystallography, now provide powerful priors for structure prediction. However, they are not readily conditioned on the measurements from a new crystallographic experiment, limiting their use for X-ray structure determination. In crystallography, the measured structure-factor amplitudes do not by themselves determine an electron density map or atomic structure because the associated phases are unobserved and must be inferred. Structure determination therefore remains an inverse problem in which candidate models must be both structurally plausible and consistent with measured diffraction data, often requiring substantial manual refinement by human experts. Emerging methods aim to incorporate experimental information more directly into predictive and refinement workflows. We present CrystalBoltz, a generative framework that casts crystallographic refinement as Bayesian inference over atomic structures and operates directly on structure-factor amplitudes. CrystalBoltz moves from unguided generation with a pre-trained prior over protein structures to experiment-guided posterior sampling, followed by atomic coordinate and B-factor refinement. Across multiple protein crystallography datasets, CrystalBoltz attains lower coordinate RMSD and lower R-factors than the strongest baselines considered, while reducing runtime by a factor of 33 relative to existing experimentally guided refinement.
[LG-49] Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
链接: https://arxiv.org/abs/2605.15551
作者: Pedram Bakhtiarifard,Sophia N. Wilson,Mahmoud Afifi,Jonathan Wenshøj,Raghavendra Selvan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training large-scale deep neural networks (DNNs) is resource-intensive, making model compression a practical necessity. The widely accepted ‘‘learning as compression’’ hypothesis posits that training induces structure in network weights, which enables compression. Measuring this structure through Kolmogorov-Chaitin-Solomonoff (KCS) complexity is appealing, but existing estimators based on the Coding Theorem Method (CTM) and the Block Decomposition Method (BDM) are limited to small binary objects and do not scale to modern DNNs. We introduce the Quantized Block Decomposition method (QuBD), which extends algorithmic complexity estimation to any k -ary object. QuBD first quantizes the network weights to a finite alphabet, then estimates the KCS complexity by aggregating per bit-plane CTM estimates. We show theoretically that QuBD yields a strictly tighter estimation gap with respect to true KCS complexity than binarization-based methods. Using QuBD, we study how the algorithmic complexity of neural network weights evolves during training, showing that it decreases as models learn, scales with data budget, increases during overfitting, follows the delayed generalization observed during grokking, and correlates with generalization performance. We further show that algorithmic information resides predominantly in the most significant bit-planes, which can serve as a practical diagnostic for determining appropriate post-training quantization levels. This work offers novel insights into learning mechanisms in DNNs by providing the first scalable, tractable estimates of KCS complexity for large, non-binary objects such as DNN weights.
[LG-50] Rethinking Neural Network Learning Rates: A Stackelberg Perspective
链接: https://arxiv.org/abs/2605.15530
作者: Sihan Zeng,Sujay Bhatt,Sumitra Ganesh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.
[LG-51] OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data
链接: https://arxiv.org/abs/2605.15511
作者: Louisa Cornelis,Johan Mathe,Louis Van Langendonck,Guillermo Bernárdez,Nina Miolane
类目: Machine Learning (cs.LG)
*备注: 42 pages
Abstract:Graph Neural Networks (GNNs) have become the dominant framework for inductive graph-level learning. Yet most benchmarks focus on the regime n \gg p , where the number of graphs n greatly exceeds the number of nodes per graph p . This overlooks biological domains such as omics, which operate in the opposite n \ll p regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: \textithow do GNNs perform in this low-sample, high-node omics setting? We introduce \textttOgBench (Omics-Graph Bench), the first benchmarking platform for graph-level prediction in the n \ll p regime characteristic of omics data. We provide a standardized, end-to-end modular infrastructure from raw omics data to families of featured graphs with varied structural properties. We benchmark classical GNNs, as well as GNNs designed for large graphs and omics applications, alongside MLPs and machine learning baselines to establish reference performances. Our results show that widely used GNNs often do not outperform simple MLPs and classical baselines. These findings challenge the prevailing assumption that graph structure inherently adds value in this domain, fostering a critical reassessment of current learning paradigms. Ultimately, by exposing these limitations, OgBench provides the open-source ecosystem necessary for the community to develop and validate novel architectures explicitly tailored for biological graphs. The code is available at this https URL.
[LG-52] parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning
链接: https://arxiv.org/abs/2605.15509
作者: Yijun Lu,Zilei Yang,Yuyin Ma
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:While Isaac Lab provides massive parallel UAV simulation, OmniSafe and safe-control-gym provide constrained-RL benchmarks, and CBFKit provides control-barrier-function synthesis tooling, no existing framework unifies these capabilities for end-to-end safety-constrained training. ParallelCBF is the first framework to unify (i)~tensor-parallel UAV environments, (ii)~hard-gate CBF safety filters, (iii)~sharded BC-to-RL pipelines, and (iv)~first-class operational auditability – pre-registration, watchdog registries, failure forensics, and dataset audits as composable APIs rather than user-implemented scripts. We release ParallelCBF v0.1.0 under Apache~2.0 with a four-layer composable API, a CPU PyTorch reference implementation of a dual-barrier (squared / linear-predictive) CBF, property-based safety invariance tests across vectorized batch sizes that complete in 1.67~s for the full 39-test suite, and a 31,415-episode behavior-cloning collection campaign whose curriculum mix, per-bucket yields, and dataset SHA-256 are auditable through the framework’s own \textttops primitives. We report a representative end-to-end pipeline execution in which the framework’s auditability layer halted a downstream training stage that did not meet pre-registered convergence criteria, preventing silent propagation of a degraded checkpoint – an architectural property we argue is necessary, not merely useful, for reproducible empirical robotics research. The framework is installable via \textttpip install parallelcbf; source and release artifacts are available at this https URL.
[LG-53] SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference
链接: https://arxiv.org/abs/2605.15488
作者: Shi-ang Qi,Vahid Balazadeh,Michael Cooper,Russell Greiner,Rahul G. Krishnan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Survival analysis provides a powerful statistical framework for modeling time-to-event outcomes in the presence of censoring. However, selecting an appropriate estimator from the many specialized survival approaches often requires substantial methodological and domain expertise. We introduce SurvivalPFN, a prior-data fitted network that amortizes Bayesian inference for censored observations through in-context learning. SurvivalPFN is pretrained on a diverse family of synthetic, identifiable, and right-censored data-generating processes, enabling it to amortize survival analysis in a single forward pass during inference. As a result, the model adapts to the effective complexity of each dataset without task-specific training or hyperparameter tuning, avoids restrictive parametric assumptions, and produces calibrated survival distributions. In a large-scale benchmark spanning 61 datasets, 21 methods, and 5 evaluation metrics, SurvivalPFN achieves strong predictive performance and often improves upon established survival models. These results suggest that SurvivalPFN offers a principled and practical foundation model for survival analysis, with potential applications in high-impact domains such as healthcare, finance, and engineering (this https URL).
[LG-54] Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting
链接: https://arxiv.org/abs/2605.15470
作者: Daniel Holmberg,Joel Oskarsson,Erik Larsson,Fredrik Lindsten,Teemu Roos
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Preprint
Abstract:Ocean dynamics are inherently chaotic, yet existing machine learning ocean models produce only deterministic forecasts. We introduce Njord, a probabilistic data-driven model for ocean forecasting, applicable to both global and regional domains. Njord combines a deep latent variable framework with a graph neural network architecture, enabling sampling each forecast step in a single forward pass. We apply Njord globally at 0.25° resolution and regionally to the Baltic Sea at 2 km resolution. To scale to these large ocean grids we introduce K-means cluster meshes that adapt to irregular sea surface geometry. Experiments demonstrate strong performance on both domains compared to deterministic machine learning baselines, while also providing uncertainty estimates from the sampled ensemble forecasts. On the global OceanBench benchmark, Njord achieves the lowest errors on average across upper-ocean variables when evaluated against real-world observations, with the largest improvements in surface temperature prediction.
[LG-55] oward World Modeling of Physiological Signals with Chaos-Theoretic Balancing and Latent Dynamics
链接: https://arxiv.org/abs/2605.15465
作者: Yunfei Luo,Xi Chen,Yuliang Chen,Lanshuang Zhang,Md Mofijul Islam,Siwei Zhao,Peter Kotanko,Subhasis Dasgupta,Andrew Campbell,Rakesh Malhotra,Tauhidur Rahman
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: NormWear Collection: this https URL
Abstract:Physiological time series signals reflect complex, multi-scale dynamical processes of the human body. Existing modeling studies focus on static tasks such as classification, event forecasting, or short-horizon next step prediction, while long-horizon signal-level forecasting and predictive nature of physiological signals remain underexplored. We introduce NormWear-2, a world model that encodes both multivariate physiological signals and clinical intervention variables into a shared latent space and models their joint temporal evolution as a dynamical system. Our approach combines inference from prior pre-trained knowledge (intuition) with instant non-parametric latent state transition adaptation (insight), enabling coherent forecasting across multiple temporal scales, conditioned on heterogeneous clinical interventions. During the pretraining phase, we find that chaos-theoretic balancing of dynamical regime diversity yields more robust representations, with a smaller balanced corpus outperforming one twice its size and capturing bifurcation regimes. We evaluate the world model performance across diverse real-world physiological datasets spanning heterogeneous temporal resolutions and intervention regimes, covering daily life, point-of-care, and clinical settings, including fitness planning, hemodialysis, diabetes management, and surgical monitoring. These evaluation datasets comprise records from 8,026 subjects, spanning study durations from 3.2 hours for high-resolution signal data to 2.3 years for longitudinal clinical biomarker tracking. NormWear-2 achieves the best overall forecasting performance across time, frequency, and latent representation domains, with significant improvements over state-of-the-art time series foundation models, while maintaining competitive downstream representation quality, providing a step toward general-purpose world models for physiological signals.
[LG-56] Layer-wise Derivative Controlled Networks
链接: https://arxiv.org/abs/2605.15463
作者: Rowan Martnishn,Sean Anderson
类目: Machine Learning (cs.LG)
*备注: Under Review at Neural Network Elsevier
Abstract:As machine learning models grow in complexity, they increasingly struggle with three conflicting demands: the need for high accuracy, the requirement for hardware efficiency, and the necessity of functional stability. Traditional architectures often achieve performance at the expense of spiky or unpredictable behavior, where small changes in input lead to massive swings in output – a critical flaw for real-world deployment in sensitive environments. This paper introduces ChainzRule (CR), a novel neural architecture designed to harmonize these competing goals. ChainzRule replaces standard piecewise-linear activations with a Polynomial Engine governed by Differential Regularization (DREG). Unlike traditional methods that impose global, coarse-grained constraints on a model’s Lipschitz constant, DREG acts as a targeted regularization on intermediate derivatives. This approach suppresses extreme sensitivity without attenuating the representational power inherent in the Polynomial Engine. In head-to-head “Fair Fight” benchmarks, ChainzRule outperformed standard models while using 15.5x fewer parameters. On the MNIST dataset, it reduced peak gradient volatility by an average of 23.1%, ensuring a smoother and more predictable manifold. On Yelp Full ordinal regression under explicit DREG regularization, ChainzRule achieves 70.17% accuracy, validating that derivative-aware regularization is compatible with competitive performance on realistic tasks. By embedding gradient awareness into the architecture via DREG, ChainzRule demonstrates that stability and accuracy need not be competing objectives.
[LG-57] Dont Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics
链接: https://arxiv.org/abs/2605.15459
作者: Albert Kjøller Jacobsen,Leo Uhre Jakobsen,Johanna Marie Gegenfurtner,Georgios Arvanitidis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The minima of modern neural network loss functions are typically not isolated, rather they form connected components of reparameterization invariant solutions on the training data. Analytically characterizing these solutions is a hard problem, but sampling approaches are feasible. By construction, existing methods either spread over low-loss regions, and thus do not sample reparameterization invariant solutions exactly, or are inherently local, which limits exploration of other minima valleys. We propose sampling such reparameterization invariant models using a dynamical system based on kinetic energy, subject to a gravitational pull and a friction term that dissipates energy from the system. Our proposed sampler, DiMS, is guaranteed to sample exactly from the minimum level sets and depends on physically motivated hyperparameters which allows control over the exploration capabilities of the sampler. We consider uncertainty quantification in Bayesian inference as the motivating problem and observe improved performance compared to previously proposed approaches.
[LG-58] On the Stability of Growth in Structural Plasticity
链接: https://arxiv.org/abs/2605.15435
作者: Lute Lillo,Nick Cheney
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Standard deep-learning pipelines usually choose the network architecture before training and keep it fixed throughout optimization. In contrast, a model can also be adapted by editing its structure during training, for example by pruning existing hidden-neuron units or growing new ones. Although growth is appealing for adaptive and continual systems, we show that it is not simply the inverse of pruning. Pruning selects among units that have participated in training from the start, whereas growth inserts new units into an already specialized optimization trajectory. We isolate this insertion problem and show that newborn units are often forward-active but backward-starved: they participate in the forward computation, yet receive much weaker gradient signal than incumbent units. This disadvantage is minor in small MLP benchmarks, but becomes clear in harder image-classification settings with a convolutional trunk. In these settings, \textscGrow can achieve high final accuracy during the structural-editing procedure, while \textscPrune is stronger when performance is averaged over the training trajectory or when the final sparse network is retrained from scratch. Interventions targeting optimizer state, insertion, selection, and trainability show that improving the integration of newborn units can improve adaptive performance, but does not automatically produce better final subnetworks. In continual-learning benchmarks stressing plasticity loss, \textscGrow becomes competitive mainly when new units have enough time to integrate. Together, these results suggest that \textscGrow should be evaluated not only as an architecture-search operator, but as a time-sensitive optimization process whose success depends on insertion stability.
[LG-59] Spectral Priors vs. Attention: Investigating the Utility of Attention Mechanisms in EEG-Based Diagnosis
链接: https://arxiv.org/abs/2605.15433
作者: Tawsik Jawad,Gowtham Atluri,Vikram Ravindra
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalograph (EEG) timeseries signals are characterized by significant noise and coarse spatial resolution, which complicates the classification of neurodegenerative diseases. Even SOTA deep learning architectures struggle to distinguish between healthy controls and diseased subjects, or between different disease types, due to high intergroup similarity. In this paper, we show that a spectrally selective approach to feature construction enhances class separability. By isolating signal strengths within the primary brainwave bands, we transform high dimensional raw data into high value spectral features. Our results demonstrate that a) features derived from frequency and time frequency domain allow traditional machine learning models to match or exceed the performance of SOTA deep learning models, b) Attention mechanism is unable to distill the stable feature signatures that characterize healthy neural activity in both resting and task EEGs, and c) the limitations of attention based models in finding relevant spectral features appear to be fundamental in that providing frequency selective time domain input do not appreciably improve their performance. We validate our methodology across three open source resting EEG datasets and one task EEG dataset, providing robust empirical evidence for our claims.
[LG-60] DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
链接: https://arxiv.org/abs/2605.15422
作者: Jiading Gai,Shuai Zhang,Xiang Song,Bernie Wang,George Karypis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern RL post-training methods such as GRPO and DAPO train on N response sequences of R tokens sampled from a shared prompt of P tokens, but standard FlashAttention replicates all P prompt tokens N times across both forward and backward passes – duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ( N\geq16 , P\geq8\textK ), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once – a property not yet exploited at the kernel level for training. We propose \textbfDualKV, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions – shared context and per-sequence response – in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks N(P+R) tokens into P+NR tokens per micro-batch, extending the token reduction from attention to the entire model by a factor \rho = N(P+R)/(P+NR) . DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8 \times H100 GPUs ( N=32 , 8K-context), DualKV achieves 1.63 – 2.09\times policy-update speedup, enables 2\times larger micro-batches, and raises MFU from 36% to 76% . Similar gains hold for DAPO ( 2.47\times speedup, 77% MFU). At 30B MoE scale on 16 \times H100, DualKV achieves 3.82\times policy-update and 3.38\times end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).
[LG-61] Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design
链接: https://arxiv.org/abs/2605.15419
作者: Shukai Du,Junzhe Zhang,Yiming Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Flow matching trains a neural velocity field by regression against a target velocity associated with a prescribed probability path connecting a simple initial distribution to the data distribution. A central design choice is the path itself. Existing constructions, including rectified and optimal-transport-based paths, transport samples along straight lines between coupled endpoints and thus cover only a narrow class of dynamics. We observe that this corresponds to the simplest case of the least-action principle in classical mechanics, in which the kinetic Lagrangian yields free-particle straight-line trajectories. Building on this observation, we propose Lagrangian flow matching, a physics-based framework in which the probability path and velocity field are determined by minimizing the action of a general Lagrangian subject to the continuity equation and the prescribed endpoints. We show that this dynamic problem admits an equivalent static optimal transport (OT) formulation, yielding a family of simulation-free training objectives that recover OT-based flow matching as the kinetic special case and the trigonometric variance-preserving diffusion path as the harmonic-oscillator case. More general Lagrangians give rise to new probability paths and velocity fields, and numerical experiments show that they induce meaningful changes in the learned dynamics while remaining competitive with existing conditional flow matching models.
[LG-62] ransformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models
链接: https://arxiv.org/abs/2605.15413
作者: Mahdi Naser Moghadasi,Faezeh Ghaderi
类目: Machine Learning (cs.LG)
*备注: 8 pages, accepted at IEEE BigData 2025
Abstract:Despite the remarkable success of transformer architectures in natural language processing, their scalability limitations remain poorly understood through systematic empirical analysis. This paper presents the first comprehensive large-scale evaluation of 118 transformer models across seven distinct architectural categories, revealing fundamental performance walls that manifest as hard deployment constraints. Our systematic benchmarking methodology uncovers a critical scalability crisis: while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens. Through rigorous analysis of loading times, memory consumption, and computational efficiency across sequence lengths from 128 to 2048 tokens, we demonstrate that compressed models achieve superior parameter efficiency (649.2 tokens/sec/M parameters) compared to large generative models (12.5 tokens/sec/M). Our findings challenge prevailing scaling assumptions and provide the first quantitative evidence that the theoretical O(n2) attention complexity translates into measurable performance walls. This work establishes new benchmarking methodologies for transformer evaluation and provides critical insights for practical deployment decisions in production environments.
[LG-63] ϕ-Balancing for Mixture-of-Experts Training
链接: https://arxiv.org/abs/2605.15403
作者: Lizhang Chen,Jonathan Li,Qi Wang,Runlong Liao,Shuozhe Li,Chen Liang,Ni Lao,Qiang Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose \phi -balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, \phi -balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.
[LG-64] LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
链接: https://arxiv.org/abs/2605.15393
作者: Philipp Mondorf,Samuel J. Bell,Jesse Dodge,Dieuwke Hupkes
类目: Machine Learning (cs.LG)
*备注: 41 pages, 31 figures
Abstract:As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities \unicodex2013 such as names, numbers, or other contextual details \unicodex2013 have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations to find those that maximize difficulty and expose failures. We show that as difficulty increases, performance declines and errors in the models’ reasoning chains become more pronounced. We further demonstrate that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling. Finally, we show that fine-tuning on more difficult variations leads to more consistent robustness gains than training on easier ones.
[LG-65] Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation
链接: https://arxiv.org/abs/2605.15388
作者: Zhankun Luo,Antesh Upadhyay,M. Berk Sahin,Sang Bin Moon,Anuran Makur,Abolfazl Hashemi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Stochastic estimators are fundamental to large-scale optimization, where population quantities must be inferred from noisy oracle observations. Although influential methods such as momentum, SPIDER, STORM, and PAGE have been highly successful, their analyses are largely estimator-specific and expectation-based, obscuring the structural tradeoffs that determine reliability. In this paper, we develop a unified framework for stochastic variance-reduced estimation based on a recursion with three components: memory retention, reset probability, and a correction term for iterate movement. This framework recovers several classical estimators, motivates new second-order variants, and yields a bias-variance decomposition of estimation error. Our main result is a unified high-probability bound proved using a new dimension-free vector-valued Freedman inequality, valid for smooth normed spaces involving random sums of vector martingales. The result applies in both Euclidean and non-Euclidean settings, including the analysis of mirror-descent-based methods in Banach spaces. As applications, we obtain high-probability oracle complexities for unconstrained optimization with mirror descent, establishing the logarithmic dependence on the confidence level. We also derive the first \tilde\mathcalO(\varepsilon^-3) oracle-complexity bounds for stochastic optimization with expectation constraints, improving upon the existing \tilde\mathcalO(\varepsilon^-4) complexity by leveraging variance-reduced estimation for the first time in this setting.
[LG-66] PRB-RUPFormer: A Recursive Unified Probabilistic Transformer for Residual PRB Forecasting
链接: https://arxiv.org/abs/2605.15363
作者: Saad Masrur,Yuxuan Jiang,Matti Hiltunen,Ajay Rajkumar,Ismail Guvenc
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for publication in the Proceedings of the 2026 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN 2026), Washington, DC, USA, May 11-14, 2026
Abstract:Accurate forecasting of residual Physical Resource Blocks (PRBs) is critical for proactive network slice provisioning, energy-efficient operation, and spectrum-aware decision making in cellular systems, where residual PRBs serve as a practical proxy for short- and medium-term spectrum availability. Existing PRB prediction methods typically rely only on historical PRB values and are trained independently per carrier or sector, limiting their ability to capture cross-carrier dependencies and providing no measure of forecast uncertainty. Moreover, point forecasts alone are insufficient for robust spectrum-aware control under highly variable traffic conditions. This paper proposes PRB-RUPFormer, a recursive unified probabilistic Transformer for residual PRB forecasting. The proposed model jointly processes multivariate KPI time series using temporal, seasonal, and carrier-aware embeddings, preserving inter-metric temporal coupling during recursive rollout and stabilizing long-horizon forecasting. A single shared model is trained across all carriers and sectors of an eNB, enabling efficient learning of joint traffic dynamics with low computational overhead. Forecast uncertainty is captured through quantile-based prediction intervals, providing confidence-aware estimates of future PRB availability. Evaluations on six months of commercial LTE network data from multiple U.S. locations demonstrate median MAE below 0.05 and hit probabilities above 0.80 for both one-day and seven-day recursive forecasts. These probabilistic predictions directly support spectrum-aware RAN functions such as dynamic carrier activation, congestion avoidance, and proactive spectrum sharing, making the proposed framework well-suited for dynamic spectrum access scenarios.
[LG-67] Federated Learning of Spiking Neural Networks under Heterogeneous Temporal Resolutions
链接: https://arxiv.org/abs/2605.15355
作者: Sanja Karilanova,Subhrakanti Dey,Ayça Özçelikkale
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking neural networks (SNNs) are biologically inspired energy-efficient models that use sparse binary spike-based communication between neurons, making them attractive for resource-constrained edge devices. Federated learning enables such devices to train collaboratively without sharing raw data. In time-series applications, edge devices often collect data at different time resolutions due to hardware and energy constraints. This temporal heterogeneity poses a fundamental challenge for federated learning: parameters learned at one temporal resolution do not necessarily transfer directly to another, which might result in the naive federated averaging being ineffective. Targeting SNNs and, more broadly, deep networks with stateful neurons, we propose a federated learning framework that addresses this temporal resolution mismatch. We investigate how neuron parameters learned from data at different temporal resolutions and model aggregation should be integrated. We evaluate the proposed framework across two SNN-native benchmark datasets (SHD and DVS-Gesture) under a range of resolution heterogeneity scenarios. Our results show that the proposed adaptation methods can substantially recover accuracy lost due to temporal mismatch, hence enabling each client to train at their local temporal resolution while remaining compatible with the global model.
[LG-68] Controllable Molecular Generative Foundation Models
链接: https://arxiv.org/abs/2605.15354
作者: Yihan Zhu,Yuhan Liu,Weijiang Li,Tengfei Luo,Meng Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite the success of foundation models in language and vision, molecular graph generation still lacks a unified framework for heterogeneous design tasks with reliable controllability. While reinforcement learning (RL) offers a natural post-training mechanism for task-specific optimization, applying it to graph generative models is hindered by the vast atom-wise action spaces and chemically invalid intermediate states. We propose \textbfControllable \textbfMolecular Generative Foundation Models (CoMole), built with a unified motif-aware graph diffusion pipeline. By learning a motif-aware graph space, CoMole transfers pretrained structural priors into controllable generation, where RL optimizes conditional reverse policies over chemically meaningful decisions. We theoretically characterize the bottleneck of atom-level RL and justify motif-aware policy optimization. Across three heterogeneous benchmarks spanning materials and drug discovery, CoMole ranks first in controllability on all nine targets, reduces MAE by up to 48.2% relative to the strongest baselines, and maintains validity above 0.94 without rule-based correction or post-hoc filtering. We further show that CoMole transfers controllability to unseen properties by optimizing only task embeddings with the generator frozen, achieving performance competitive with strong task-specific baselines.
[LG-69] Bounded-Rationality Hedging and Generalization
链接: https://arxiv.org/abs/2605.15340
作者: Pedro A. Ortega
类目: Machine Learning (cs.LG)
*备注: 32 pages, 9 figures
Abstract:A learner does not only fit data; it also determines how strongly the training sample may shape its output and how much distortion it can hedge. We study this relation as a bounded-rational decision problem whose primitive object is the induced channel from samples to outputs. The learner’s response law determines which changes in this channel are cheap or costly, and therefore induces both a lower tradeoff curve between training loss and sample dependence and a matched upper certificate curve. When the response law is represented by an f -divergence regularizer, these curves live in the regularizer’s native information geometry, with KL as the special case corresponding to Shannon mutual information. We show how the hedge and the two curves can be recovered from black-box behavior by observing responses to scaled losses and local loss perturbations. In learning, population loss is empirical loss plus the distortion induced by the particular training sample. The recovered hedge gives a practical certificate when it covers that distortion. Thus generalization is treated as a testable hedging property of the learner’s own response law.
[LG-70] From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks
链接: https://arxiv.org/abs/2605.15328
作者: Thodoris Lymperopoulos,Denia Kanellopoulou
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
Abstract:Fully Connected Neural Networks (FCNNs) are often regarded as simple and intuitive architectures, yet they serve as the foundation for more complex models. Nonetheless, the lack of consensus on their interpretability continues to pose challenges, underscoring the enduring relevance of simpler, attribution-based approaches for understanding even the most advanced neural architectures. In this regard, we explore a novel idea for estimating feature attribution, by applying perturbation to the features’ attached weights instead of their values. This method offers a fresh perspective aimed at mitigating common limitations in Occlusion techniques, such as Added Bias and Out-of-Distribution data. The application of this rule leads to the formation of a pair of novel attribution methods we call XWP and XWP_c. Founded on simple rules, our methods achieve competitive performance in identifying image signals for simple DNNs, competing with the most established attribution methods on standard baseline metrics. Our work thus contributes to the field of Explainability by introducing a robust framework that paves the way for addressing these long-standing vulnerabilities, and leads to more reliable and interpretable model explanations.
[LG-71] Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
链接: https://arxiv.org/abs/2605.15314
作者: Antesh Upadhyay,Arda Fazla,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study nonconvex stochastic optimization under the Blum-Gladyshev ( \mathsfBG -0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization. We consider this problem under both standard smoothness and the symmetric generalized-smoothness framework, which captures objectives whose local curvature can scale with the gradient norm. We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under \mathsfBG -0 noise with oracle complexity O(\varepsilon^-6) . This rate holds both for standard smoothness and for \alpha -symmetric generalized smoothness, showing that generalized smoothness is rate-neutral for normalized momentum in this setting. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal O(\varepsilon^-4) complexity, matching the lower bound. Under expected \alpha -symmetric generalized smoothness, the STORM recursion couples gradient-dependent smoothness with distance-dependent noise, leading to complexity O(\varepsilon^-(4+\alpha)) for \alpha\in(0,1) and O(\varepsilon^-5) for \alpha=1 . When the distance-growth parameter in the noise model vanishes, our guarantees recover the standard bounded-variance rates: O(\varepsilon^-4) for momentum, O(\varepsilon^-3) for variance reduction, and O(\varepsilon^-2) in the deterministic case. To our knowledge, these are the first convergence guarantees for normalized methods in non-convex stochastic optimization under \mathsfBG -0 noise without bounded domains, increasing batch sizes, or explicit anchoring, covering both standard and generalized smoothness regimes.
[LG-72] me-Varying Deep State Space Models for Sequences with Switching Dynamics
链接: https://arxiv.org/abs/2605.15311
作者: Sanja Karilanova,Subhrakanti Dey,Ayça Özçelikkale
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The identification and modeling of time-varying systems is a fundamental challenge in signal processing and system identification. To address this challenge, we propose a class of time-varying state-space model (SSM) based neural networks in which the neurons’ states are governed by time-varying dynamics. The proposed model provides the learnable time-varying dynamics through a dictionary of basis functions, where each basis function evolves differently over time. We evaluate the proposed approach on both synthetic data from switching systems and a speech denoising task where real audio is corrupted with switching dynamics noise. The results show that the proposed time-varying model consistently outperforms its time-invariant counterparts while maintaining comparable computational complexity. Our investigations also reveal which aspects of the time-varying dynamics of the data most need to be captured by the proposed time-invariant models, how the additional freedom provided by time-varying basis functions should be allocated across model components, and to what extent larger models can compensate for time-invariant limitations.
[LG-73] How Data Augmentation Shapes Neural Representations
链接: https://arxiv.org/abs/2605.15306
作者: Tianxiao He,Alex H. Williams,Sarah E. Harvey
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Data augmentation is widely recognized for improving generalization in deep networks, yet its impact on the geometry of learned representations remains poorly understood. In this work, we characterize how different data augmentation strategies reshape internal representations in neural networks. Using tools from shape analysis, we embed network hidden representations into a metric space where distance is invariant to scaling, translation, rotation and reflection. We show that increasing augmentation strength leads to well-behaved trajectories in this space, and that different augmentation types steer representations in distinct directions. Moreover, we investigate how neural representation shapes are distorted along data augmentation trajectories, and show that insights from neural geometry can predict which representations provide the most improvement when ensembling models. Our results reveal shared geometric patterns across architectures and seeds, and suggest that analyzing shape-space trajectories offers a principled tool for understanding and comparing data augmentation methods.
[LG-74] Unified Simulation of Lagrangian Particle Dynamics via Transformer
链接: https://arxiv.org/abs/2605.15305
作者: Caoliwen Wang,Minghao Guo,Siyuan Chen,Heng Zhang,Mengdi Wang,Xingyu Ni,Hanson Sun,Kunyi Wang,Zherong Pan,Kui Wu,Lingjie Liu,Yin Yang,Chenfanfu Jiang,Taku Komura,Wojciech Matusik,Peter Yichen Chen
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:A unified simulator that can model diverse physical phenomena without solver-specific redesign is a long-standing goal across simulation science. We present a learning-based particle simulator built on a single transformer architecture to model cloth, elastic solds, Newtonian and non-Newtonian fluids, granular materials, and molecular dynamics. Our model follows a prediction-correction design on a shared Lagrangian particle representation. An explicit predictor first advances particles under the known external forces, producing an intermediate state that captures externally driven motion but not inter-particle interactions. A learned corrector then predicts the residual position and velocity updates through three stages: a particle tokenizer that encodes local particle-particle, particle-boundary, and topology-guided interactions; a super-token encoder that hierarchically merges particle tokens into a compact set of super tokens via alternating self-attention and token merging; and a super-token decoder that lifts these super tokens back to particle resolution through cross-attention to predict per-particle position and velocity corrections. Progressive token merging reduces the attention cost at successive encoder layers by halving the token count at each level, and the decoder communicates through the compact super-token set rather than full particle-to-particle attention. Across the six dynamics categories, the same architecture generalizes to unseen materials, boundary configurations, initial conditions, and external forces. We further demonstrate downstream interactive control, inverse design, and learning from real-world manipulation data, reducing the need for per-phenomenon solver engineering.
[LG-75] adpole: Autoencoders as Foundation Models for 3D PDEs with Online Learning
链接: https://arxiv.org/abs/2605.15284
作者: Qiang Liu,Felix Koehler,Benjamin Holzschuh,Nils Thuerey
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Tadpole, a novel foundation model for three-dimensional partial differential equations (PDEs) that addresses key challenges in transferability, scalability to high dimensionality, and multi-functionality. Tadpole is pre-trained as an autoencoder on synthetic 3D PDE data generated by an efficient online data-generation framework. This enables large-scale, diverse training without storage or I/O overhead, demonstrated by scaling to an equivalent of hundreds of terabytes of training data. By autoencoding single-channel spatial crops, Tadpole learns rich and transferable representations across heterogeneous physical systems with varying numbers of state variables and spatial resolutions. Although pre-trained solely as an autoencoder, Tadpole can be efficiently applied for multiple downstream tasks beyond reconstruction, including dynamics learning and generative modeling. For dynamics learning, we propose a novel parameter-efficient fine-tuning strategy that integrates low-rank adaptation, latent-space transformations, and reintroduced skip connections, achieving accurate temporal modeling with a minimal number of trainable parameters. Tadpole demonstrates strong fine-tuning performance across various downstream tasks, highlighting its versatility and effectiveness as a foundation model for 3D PDE learning. Source code and pre-trained weights of Tadpole are available at this https URL
[LG-76] raining on Documents About Monitoring Leads to CoT Obfuscation
链接: https://arxiv.org/abs/2605.15257
作者: Reilly Haskins,Bilal Chughtai,Joshua Engels
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-thought (CoT) monitoring is one of the most promising tools we have for detecting model misbehavior, but its effectiveness depends on models faithfully externalizing their reasoning. Motivated by this vulnerability, we study whether monitor-aware models are capable of obfuscating their reasoning to evade detection. We use synthetic document finetuning to expose eight models to realistic pre-training-style documents describing a CoT monitor and find that monitor-aware models consistently achieve higher rates of undetected misbehavior compared to unaware controls. This effect is weaker but still present on a harder agentic task. We also show that CoT controllability, a model’s ability to reshape its own reasoning trace under an imposed constraint, is closely correlated with obfuscation success across the eight models studied ( r=0.800 , p=0.017 ). Monitor-aware models placed under equal reinforcement learning optimization pressure also learn to reward-hack without triggering a CoT monitor substantially faster than unaware controls. Together, these results suggest that knowledge of monitoring combined with high CoT controllability poses a risk to CoT-based monitoring.
[LG-77] Curriculum Learning of Physics-Informed Neural Networks based on Spatial Correlation
链接: https://arxiv.org/abs/2605.15254
作者: Xujia Chen,Xinyue Hu,Letian Chen,Daming Shi,Wenhui Fan
类目: Machine Learning (cs.LG)
*备注: 37 pages, 14 figures, 9 tables
Abstract:Physics-Informed Neural Networks (PINNs) combine deep learning with physical constraints for solving partial differential equations (PDEs), and are widely applied in fluid mechanics, heat transfer, and solid mechanics. However, PINN training still suffers from high-dimensional non-convex loss landscapes, imbalanced multiobjective constraints, and ineffective information propagation. Existing curriculum learning and causality-guided strategies improve training stability, but mainly focus on temporal or parametric progression, lacking explicit treatment of spatial information propagation and inter-region consistency. Moreover, they are not directly applicable to boundary value problems (BVPs) with strong spatial coupling. To address this issue, we propose a spatially correlated curriculum learning framework for PINNs. To the best of our knowledge, this is the first work to address PINN training difficulties from the perspective of spatial coupling among subregions. First, spatial causal weights guide information from near-boundary regions inward, reducing optimization failures and spurious convergence. Second, a low-frequency information bridge enforces pseudo-label-based consistency across spatially separated regions, suppressing global low-frequency drift. Third, a region-adaptive reweighting strategy adjusts subregion losses to reduce local residuals and recover high-frequency details. Experiments on PDE benchmarks show that, under comparable computational cost, the proposed method alleviates training failures and improves solution accuracy. The code is available at this https URL.
[LG-78] Position: Ideas Should be the Center of Machine Learning Research ICML2026 ICML
链接: https://arxiv.org/abs/2605.15253
作者: Jairo Diaz-Rodriguez
类目: Machine Learning (cs.LG)
*备注: Accepted into ICML 2026 this https URL
Abstract:Machine learning research increasingly bifurcates into two disconnected modes: benchmark-driven engineering that prioritizes metrics over understanding, and idealized theory that often fails to transfer to modern systems. In this position paper, we argue that the field focuses too heavily on these endpoints, neglecting the central scientific object: the idea. We propose an Ideas First framework in which ideas are valued for the behavioral signatures they predict in modern models, and these signatures are tested through tailored experiments designed to detect the relevant patterns rather than to win leaderboards. This shift not only bridges the gap between theory and practice but also promotes equity by removing the “complexity premium,” enabling rigorous scientific contributions from researchers with modest computational, financial, and human resources. Ultimately, we advocate for a research culture centered on ideas, treating benchmarks and theorems as instruments for testing mechanistic hypotheses rather than as ends in themselves.
[LG-79] Enabling Adversarial Robustness in AI Models through Kubeflow MLOps MDM2026
链接: https://arxiv.org/abs/2605.15249
作者: Stavros Bouras,Ioannis Korontanis,Antonios Makris,Konstantinos Tserpes
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the 1st Workshop on Secure and Intelligent Data Spaces (SIDS 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (MDM 2026)
Abstract:AI models are increasingly deployed in cloud-native environments to support scalable and automated services. However, while platforms such as Kubernetes provide strong infrastructure orchestration, security mechanisms specifically designed to protect deployed AI models remain limited. This paper presents security measures for AI models deployed in Kubernetes clusters. The proposed architecture integrates Kubeflow-based MLOps to automatically detect adversarial attacks during the inference phase and trigger defense mechanisms that preserve the model’s accuracy and reliability. Specifically, a Fast Gradient Sign Method (FGSM) attack is applied at inference time, and a Projected Gradient Descent (PGD)-based adversarial training defense is automatically deployed when a degradation in accuracy is detected. The experimental results indicate that the deployed defense robustifies the model, significantly recovering accuracy relative to the degradation caused by the attack.
[LG-80] Privacy Evaluation of Generative Models for Trajectory Generation MDM2026
链接: https://arxiv.org/abs/2605.15246
作者: Stavros Bouras,Ioannis Kontopoulos,Chiara Pugliese,Francesco Lettich,Emanuele Carlini,Hanna Kavalionak,Chiara Renso,Konstantinos Tserpes
类目: Machine Learning (cs.LG)
*备注: Accepted at the 1st Workshop on Multi-Sensor Trajectory Knowledge Discovery and Extraction (MuseKDE 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (IEEE MDM 2026)
Abstract:Trajectory data is fundamental to modern urban intelligence, yet its sensitivity raises significant privacy concerns. Generative models such as Generative Adversarial Networks, Variational Autoencoders, and Diffusion Models have been developed to generate realistic synthetic trajectory data by capturing underlying spatiotemporal distributions and mobility patterns. Although these models are often assumed to preserve privacy due to their generative nature, this assumption does not necessarily hold. In this work, we investigate the intersection of generative trajectory modeling and privacy evaluation. By identifying applicable empirical methods for assessing privacy preservation in trajectory generation tasks, we demonstrate a significant gap in the evaluation of privacy for generative trajectory models. Motivated by this gap, we implement Membership Inference Attacks against representative models, demonstrating the feasibility of using such empirical privacy evaluation methods and showing that their generative nature does not eliminate privacy risks.
[LG-81] Logical Grammar Induction via Graph Kolmogorov Complexity: A Neuro-Symbolic Framework for Self-Healing Clinical Data Integrity
链接: https://arxiv.org/abs/2605.15242
作者: Abolfazl Zarghani,Amir Malekesfandiari
类目: Machine Learning (cs.LG)
*备注:
Abstract:The reliability of Healthcare Information Systems (HIS) is frequently compromised by human-induced data entry errors, which existing statistical anomaly detection methods fail to distinguish from legitimate clinical extremes. This paper proposes Logic-GNN, a novel neuro-symbolic framework that treats clinical records as a structured private language'' governed by latent logical games. By integrating Temporal Graph Neural Networks (TGNN) with Graph Kolmogorov Complexity, we induce a symbolic grammar that represents the underlying logic of medical interactions. We define anomalies as grammatical violations’’ that cause a significant expansion in the Minimum Description Length (MDL) of the clinical graph. Evaluated on the Sina System dataset (2M+ records), Logic-GNN achieves an F1-score of 0.94, outperforming state-of-the-art baselines by 12% in distinguishing between life-threatening medical outliers and data corruption. Our approach introduces a self-healing mechanism that suggests logical corrections to maintain data integrity in real-time HIS environments.
[LG-82] Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
链接: https://arxiv.org/abs/2605.15239
作者: Yu Fu,Longxuan Yu,Haz Sameen Shahgir,Zhipeng Wei,Hui Liu,N.Benjamin Erichson,Yue Dong
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures
Abstract:Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emphteacher flip rate: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety–reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.
[LG-83] MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
链接: https://arxiv.org/abs/2605.15235
作者: Wugeng Zheng,Ziwen Kan,Tianlong Chen,Chen Chen,Song Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal physiological data powers clinical AI systems from intensive care units to wearable devices, but sensors routinely fail in practice. Two failure modes are common: modality missing, where an entire channel is absent, and within-modality missing, where a contiguous time segment is lost. No existing benchmark evaluates multiple fusion architectures under both failure modes at controlled severity levels across diverse clinical datasets. We present MuteBench, a benchmark covering 9 datasets from 7 clinical domains, 6 fusion architectures, and 2 missing-data modes over 125,000 samples. Through this benchmark, we find that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but can be sensitive to within-modality missing, especially on short sequences. Curriculum modality dropout protects reliably only up to the maximum dropout rate used in training. We also find that channel count, sequence length, and modality alignment jointly determine which failure mode poses the greater threat. Finally, a PTB-XL case study suggests that diffusion-based imputation can improve downstream classification under within-modality missing, with the largest gains for models whose expert routing is most sensitive to corrupted inputs, though broader validation across datasets remains an open direction. MuteBench provides practitioners with concrete guidance for both selecting existing architectures and informing the design of future robust multimodal fusion methods.
[LG-84] Hardware-Software Co-Design of Scalable Energy-Efficient Analog Recurrent Computations
链接: https://arxiv.org/abs/2605.15216
作者: Arthur Fyon,Julien Brandoit,Loris Mendolia,Damien Ernst,Jean-Michel Redouté,Guillaume Drion
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Always-on AI applications, from environmental sensors to biomedical implants, require ultra-low power consumption. Analog circuits offer a path to sub-microwatt inference, yet existing analog implementations are limited to feedforward architectures: extending them to recurrent dynamics has been considered impractical due to noise accumulation through temporal feedback. We demonstrate that this barrier can be overcome through hardware-software co-design. Specifically, we identify that Bistable Memory Recurrent Units (BMRUs), a class of Recurrent Neural Networks (RNNs) with discrete-valued outputs and hysteretic dynamics, admit an ultra-low power current-mode analog implementation which we design from first principles. The resulting circuit establishes a one-to-one correspondence between each learned parameter and a circuit element. The discrete outputs suppress analog noise by at least 20-fold at each cell boundary, breaking the noise accumulation that prevents analog recurrence. We reformulate BMRUs for first-quadrant operation with fixed thresholds, enabling the direct correspondence while preserving expressivity and trainability. Transistor-level simulations in 180 nm Complementary Metal-Oxide-Semiconductor (CMOS) show near-perfect agreement between software predictions and circuit-level behavior, with the software model thereby serving as a high-fidelity simulator of the physical hardware at low computational cost. We leverage this fidelity to conduct large-scale noise immunity and power scaling analyses: the power cost of adding recurrence scales linearly with state dimension, while the feedforward layers dominating total power scale quadratically, meaning recurrence is added at linear marginal cost relative to the feedforward backbone. End-to-end keyword spotting achieves sub-microwatt inference at the RNN core.
[LG-85] Universal Magnetic Structure Prediction from Atomic Coordinates with Near-Experimental Accuracy
链接: https://arxiv.org/abs/2605.16230
作者: Abhijatmedhi Chotrattanapituk,Ryotaro Okabe,Eunbi Rha,Mariya Al-Hinai,Eugene Jiang,Daniel Pajerowski,Yongqiang Cheng,Joshua J. Turner,Mingda Li
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
Abstract:Magnetic order is a fundamental property of materials, governing collective behavior and enabling a broad range of functionalities. Yet magnetic structure remains difficult to determine: experiments are costly and specialized, while first-principles methods often struggle with the noncollinear and incommensurate orders found in real materials. Here we introduce magnetic structure network (MSN), an E(3) equivariant graph neural network that predicts both collinear and non-collinear magnetic structures directly from atomic crystal structures, trained directly on experimentally determined structures from MAGNDATA. By proposing the primitive modulated structure representation (PMSR), we are able to encode commensurate and incommensurate structures in a unified way without symmetry assumptions. The model achieves strong performance across all modulation components and reconstructs experimental magnetic structures with high fidelity. Our approach provides a scalable framework for rapid magnetic structure prediction and opens a route to data-driven discovery of magnetic materials.
[LG-86] A Scalable Nonparametric Continuous-Time Survival Model through Numerical Quadrature
链接: https://arxiv.org/abs/2605.16208
作者: Chaeyeon Lee,Sehwan Kim,Hyungrok Do
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Flexible continuous-time survival modeling is critical for capturing complex time-varying hazard dynamics in high-dimensional data; however, training such models remains challenging due to the intractable integral required for likelihood estimation. We introduce QSurv, a scalable deep learning framework that enables nonparametric continuous-time modeling without relying on time discretization or restrictive distributional assumptions. We propose a training objective based on Gauss-Legendre numerical quadrature, which approximates the cumulative hazard with high-order accuracy while facilitating efficient end-to-end training via standard backpropagation. Furthermore, to effectively capture non-stationary hazard dynamics in complex architectures, we introduce time-conditioned low-rank adaptation, a mechanism that conditions general neural backbones on time by dynamically modulating weights via low-rank updates. We provide theoretical analysis establishing approximation error bounds for cumulative-hazard evaluation. Comprehensive experiments across synthetic benchmarks, large-scale real-world tabular datasets, and high-dimensional medical imaging tasks demonstrate that QSurv achieves competitive predictive performance with advantages in instantaneous hazard function estimation, enabling more interpretable characterization of time-varying risk patterns.
[LG-87] SwAIther-Precip: Lead-Time-Aware Bias Correction Enables Kilometer-Scale Downscaling of Global AI Precipitation Forecasts over Switzerland
链接: https://arxiv.org/abs/2605.16163
作者: Dan Assouline,Erwan Koch,Federico Amato,Filippo Quarenghi,Daniele Nerini,Thibaut Loiseau,Kyle van de Langemheen,Tom Beucler
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Skillful medium-range precipitation forecasting at kilometer scale remains challenging over complex terrain because precipitation arises from multiscale nonlinear processes that global models cannot explicitly resolve at affordable cost. Global AI weather models can produce skillful medium-range forecasts, but their native 0.25 degrees resolution limits direct use for local hazard applications. Statistical downscaling can help bridge this gap, yet existing approaches often struggle with state-dependent, and especially lead-time-dependent, biases in global forecasts. We introduce SwAIther-Precip, a lead-time-aware downscaling framework that converts coarse-resolution AIFS forecasts into probabilistic km-scale precipitation fields over Switzerland. First, a U-Net conditioned on lead time via feature-wise linear modulation deterministically corrects systematic biases at coarse resolution. This targeted correction enables a cheaper super-resolution stage conditioned only on corrected precipitation, allowing direct training on observations rather than on the full atmospheric state. A diffusion-based model then generates fine-scale spatial variability independently of lead time. Using AIFS forecasts and CombiPrecip radar-gauge observations, SwAIther-Precip reduces CRPS by 48% relative to raw AIFS. The generated fields reproduce observed spatial variability with spectral fidelity above 0.85 at large scales and 0.88 at small scales, corresponding to an effective resolution of approximately 4 km on a 1 km grid for lead times up to 5 days. Training across lead times further improves long-range performance, yielding a 13% CRPS reduction at 6 days relative to lead-time-specific models. These results show that explicitly correcting lead-time-dependent biases before generative super-resolution is key to efficient km-scale probabilistic downscaling of global AI precipitation forecasts.
[LG-88] Skew-adaptive conformal prediction
链接: https://arxiv.org/abs/2605.16145
作者: Paulo C. Marques F.,Helton Graziadei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures
Abstract:We develop a skew-adaptive extension of split conformal prediction for regression. The method starts from an asymmetric interval family centered at a point prediction and uses the gauge approach to deduce the conformity score induced by this family. The inverse hyperbolic sine transform of signed scaled residuals provides the training target for an additional predictive model, whose role is to learn how predictive uncertainty should tilt across the feature space. The resulting procedure preserves the finite-sample marginal validity of split conformal prediction under exchangeability, while producing intervals that adapt to both local scale and local skewness. We also develop a calibration-sample-based estimator for comparing the expected relative future width of the skew-adaptive and classical scaled-score intervals. Experiments on a variety of datasets indicate gains in prediction interval efficiency over the scaled-score construction and conformalized quantile regression, and show that the proposed estimator closely matches the corresponding average width ratio observed on the test sample.
[LG-89] A numerical study into neural network surrogate model performance for uncertainty propagation
链接: https://arxiv.org/abs/2605.16078
作者: Noah Wade,Kirubel Teferra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural network surrogate models have emerged as a promising approach to model solution fields for a wide variety of boundary value problems encountered in physical modeling. Stochastic problems represent an area of particularly high interest because of the potential to significantly reduce the repeated evaluation of expensive forward models via traditional numerical solvers when conducting parametric analysis. However, many studies found in the literature primarily focus on the ability of neural network surrogate models to represent deterministic samples or mean field solutions and largely overlook surrogate model performance at the tails of the distribution. The present study examines in detail the ability of neural network surrogate models to capture the full distribution of solution fields over the entire probability space, while emphasis is placed at the tails of the distribution. Serving as a canonical problem is the heat conduction equation with a highly stochastic source term, inducing extremely large variation in the thermal solution field. Comparisons are made between a classic feed-forward fully connected network and a Deep Operator Network architecture, using both data-driven and physics-informed loss functions. Results show that the worst-case prediction errors are an order of magnitude larger than the mean field error, highlighting the importance of the outlier samples. The large errors associated with extreme samples result from the networks having to extrapolate beyond the bounds of the training data. A method for identifying these samples is presented along with a discussion of potential approaches to account of their errors. Among the models considered, the fully connected neural network trained using a weak form residual loss performs best in handling these extrapolated inputs, achieving the highest prediction accuracy for the numerically produced datasets.
[LG-90] Explainable AI Isnt Enough! Rethinking Algorithmic Contestability
链接: https://arxiv.org/abs/2605.16041
作者: Timo Freiesleben,Kristof Meding,Gunnar König
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning systems increasingly make life-changing decisions about individuals, such as loan approvals, hiring, and cheating detection, raising a pressing question: how can individuals respond to negative decisions made by these opaque systems? While explainable artificial intelligence (XAI) has largely focused on algorithmic recourse – helping individuals change their features to obtain a desired outcome – the parallel problem of algorithmic contestability – helping individuals review and correct erroneous algorithmic decisions – has received far less attention, despite its central ethical and legal importance. We trace this neglect to the absence of clear formal definitions and a systematic operationalization of contestability as an algorithmic problem. To address it, we propose an operational definition of contestability as a natural complement to recourse: contestability starts from the presumption that a decision may be incorrect and focuses on identifying evidence to challenge and potentially overturn it, whereas recourse assumes the decision is valid and instead provides pathways for changing it. We show that standard XAI explanations, such as counterfactuals, LIME, or Anchors, even when combined with human intuitions about decision continuity or monotonicity, reveal only errors in the neighborhood of the individual, but provide insufficient grounds for overturning the decision at hand. Going thus beyond traditional XAI, we identify three types of evidence warranting reversal according to the decision maker’s own ethical standards: predictive multiplicity, incorrect feature values, and neglected overruling evidence. We argue that these render decisions normatively indefensible and thus successfully contestable. Finally, we analyze how existing EU legislation connects to our framework and argue that individuals already hold some legal rights to these forms of evidence.
[LG-91] sting properties of trees in graphical models with covariance queries
链接: https://arxiv.org/abs/2605.15996
作者: Sofiya Burova,Francisco Calvillo,Gábor Lugosi,Piotr Zwiernik
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We consider the problem of testing properties of graphs underlying high-dimensional graphical models. We adopt the model of covariance queries introduced by Lugosi, Truszkowski, Velona, and Zwiernik (2021). We study the case when the underlying graph is a tree. The main results of the paper show that, while reconstructing the entire tree may be costly, certain global structural properties can be tested efficiently. In particular, we design randomized tests for global structural properties that use a sub-quadratic number of queries. We develop testing procedures for several fundamental properties, including the number of leaves, the maximum degree, the typical distance, and the diameter of the tree. For each property, we obtain explicit query complexity bounds that depend on the target threshold and tolerance parameters.
[LG-92] Clock-state olfactory search in turbulent flows using Q-learning: The geometry of plume recovery
链接: https://arxiv.org/abs/2605.15938
作者: Marco Rando,Robin A. Heinonen,Yujia Qi,Agnese Seminara
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG)
*备注: 15 pages, 13 figures, 1 table
Abstract:Finding an odor source in a turbulent flow requires effectively leveraging the history of olfactory observations into a robust navigation strategy. In this work, we use tabular Q-learning to train an olfactory search agent with a minimal memory of past observations: only a running clock since the last whiff. This agent learns an interpretable strategy to recover the plume which combines well-known behaviors observed in insects: surging, casting, and a return downwind. While achieving good performance on data from direct numerical simulations of turbulence, the agent is limited by an inability to adapt its strategy to the local intermittency level; we show that providing more flexibility improves robustness.
[LG-93] Unsupervised Domain Shift Detection with Interpretable Subspace Attribution
链接: https://arxiv.org/abs/2605.15920
作者: Sebastian Springer,Alessandro Laio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We developed a tool for detecting domain shifts, namely subtle differences in the probability distributions of datasets. We identify these shifts using an algorithm designed to detect localised density anomalies in high-dimensional feature spaces. If an anomaly is present, we then identify the feature subspace in which the anomaly is most pronounced. This allows us to trace the domain shift to a small set of features, making the shift interpretable. Moreover, we provide a protocol for compensating domain shifts by extracting, from two unlabelled datasets, subsets of samples with no detectable residual distributional difference. We validate the framework on controlled 20-dimensional benchmarks with known ground truth, recovering both broad and localized shifts together with their supporting feature subspaces. We then apply it to healthy electrocardiogram (ECG) recordings represented by 782 features. In age- and sex-matched cohort comparisons differing in measurement-device composition, the method detects device-induced shifts, extracts representative subsets enriched in the imbalanced device components, and identifies ECG features associated with the acquisition contrast. These results suggest that density-shift detection and subspace attribution provide a practical framework for uncovering hidden cohort biases before downstream modelling.
[LG-94] Reweighting free energy profiles between universal machine learning interatomic potentials for fast consensus building
链接: https://arxiv.org/abs/2605.15630
作者: Sauradeep Majumdar,Miguel Steiner,Johannes C. B. Dietschreit,Swagata Roy,Daniel Willimetz,Lukaš Grajciar,Rafael Gómez-Bombarelli
类目: Chemical Physics (physics.chem-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, 1 table, SI appended
Abstract:Free energy profiles serve as a fundamental bridge between microscopic atomic fluctuations and macroscopic thermodynamic observables. Estimating the free energy profile along a reaction coordinate, referred to as the potential of mean force (PMF), with density functional theory (DFT) accuracy is computationally expensive. Universal machine learning interatomic potentials (MLIPs) drastically reduce this cost, but their accuracy is strongly determined by their training data and hence can be uncertain for a given system. In this work, we present a systematic and scalable framework for reweighting PMFs, initially sampled with a single ‘source’ MLIP, across a representative suite of target MLIPs. Because traditional direct exponential reweighting fails for large system sizes due to low phase-space overlap between potentials, we deploy robust analytical corrections. Applying this to a complex 601-atom system of Li ^+ transport in a nanoconfined electrolyte, we demonstrate that a mean energy-gap approximation effectively bypasses statistical collapse, producing a highly stable PMF matching the target PMF. Using this approach, we recover high-fidelity target thermodynamics across multiple DFT reference levels (PBE+D3, PBE-sol, r ^2 SCAN,r ^2 SCAN-D4) at a fraction of the computational cost of full simulations. Furthermore, thermodynamic analysis reveals that the studied MLIPs partition into two distinct clusters driven by their training data. Our reweighting framework successfully recovers target thermodynamic properties–specifically, reaction and activation free energies–even when the phase-space overlap between potentials is critically low. Ultimately, this approach establishes a vital diagnostic protocol to achieve affordable cross-model consensus on materials chemistry properties without redundant, resource-intensive simulations.
[LG-95] Pessimistic Risk-Aware Policy Learning in Contextual Bandits
链接: https://arxiv.org/abs/2605.15620
作者: Yilong Wan,Yuqiang Li,Xianyi Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an \tilde\mathcalO(1/\sqrtn) rate, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization, indicating that optimizing general Lipschitz risk criteria incurs no additional statistical cost relative to the expected-reward.
[LG-96] MaxSketch: Robust Distinct Counting in Streams via Random Projections
链接: https://arxiv.org/abs/2605.15571
作者: Nikos Tsikouras,Constantine Caramanis,Christos Tzamos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only approximately similar – for example, different images of the same individual may vary significantly at the pixel level. Classical sketches such as HyperLogLog rely on consistent hash values for identical elements and break down in this regime. Recent work on robust distinct counting in general metric spaces achieves \widetilde\Theta(\sqrtn) memory, which is tight in the worst case. We show that substantially improved memory guarantees are possible under geometric structure common in learned representations. We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the number of distinct latent objects. Concretely, we show that under this assumption m = \widetildeO (\log n / \varepsilon^2) random projections (and hence \widetildeO (\log n/\varepsilon^2) memory) suffice to recover the true distinct count within a (1+\varepsilon) factor. Experiments on image streams confirm that MaxSketch accurately estimates distinct counts and generalizes beyond the training regime. Our results bridge classical streaming algorithms and modern representation learning, showing how geometric structure can fundamentally reduce the complexity of distinct counting.
[LG-97] Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
链接: https://arxiv.org/abs/2605.15522
作者: Dmitry Kovalev
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW’s exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.
[LG-98] Harnessing Unimodality in Semiparametric Contextual Pricing via Oracle Price Map Learning
链接: https://arxiv.org/abs/2605.15411
作者: Yingying Fan,Yuxuan Han,Jinchi Lv,Xiaocong Xu,Zhengyuan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study contextual dynamic pricing in a semiparametric scalar-index valuation model where the latent value is v_t=\mu_\ast(\mathsf c_t)+\xi_t , with an unknown utility map \mu_\ast and an unknown additive noise distribution. The key decision object is the one-dimensional oracle price map u\mapsto p^\ast(u) induced by the scalar index u=\mu_\ast(\mathsf c) and the noise tail. Under the \beta -Hölder smoothness of the tail function for \beta\geq 2 and a revenue-geometry condition that gives a unique, stable, interior maximizer, this oracle map is itself (\beta-1) -smooth. We exploit such structure through \mathsfORBIT , a modular coarse-to-fine policy that takes a scalar pilot index as input, localizes a benchmark price in each active bin, and learns a local polynomial approximation of the oracle map inside a trust region via bandit convex optimization. For the baseline linear utility model \mu_\ast(\mathsf c)=\mathsf c^\top\theta_\ast , an adaptive elliptical exploration scheme constructs the required scalar pilot online without distributional assumptions on the contexts. The resulting policy achieves regret \widetildeO\big(T^\frac2\beta-14\beta-3+\sqrtdT\big) . For fixed d , we establish a matching lower bound in the horizon dependence, unveiling that the nonparametric oracle-map learning term is minimax sharp. The same scalar-pilot interface also yields extensions to sparse high-dimensional linear utility and nonparametric Hölder utility.
[LG-99] Quantum Feature Pyramid Gating for Seismic Image Segmentation
链接: https://arxiv.org/abs/2605.15370
作者: Taha Gharaibeh,Jyotsna Sharma
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Accurate salt-body delineation is essential for seismic interpretation because salt structures distort wave propagation, complicate velocity-model building, obscure reservoir geometry, and increase uncertainty in exploration and drilling decisions. Although hybrid quantum-classical models have shown competitive performance on small-scale image-classification tasks, their value for dense, pixel-level geophysical prediction remains largely untested. This work introduces quantum feature gating, a hybrid segmentation architecture that embeds a parameterized quantum circuit (PQC) at feature-fusion points within an encoder-decoder pipeline. A 4-qubit, 2-layer PQC with data re-uploading computes a learned convex combination of lateral and top-down features at each Feature Pyramid Network merge point. A global-average-pooling layer maps encoder features to a fixed 4-dimensional quantum input, decoupling the 72-parameter quantum budget from backbone size and image resolution. The method is evaluated on the 2018 TGS Salt Identification Challenge using 4,000 seismic images at 101 x 101 resolution, across two integration topologies, eight circuit variants, and six encoders with 8M to 118M parameters under five-fold cross-validation. In a controlled EfficientNetV2-L ablation at 256 x 256 resolution, replacing the three Quantum FPN Gates with element-wise addition while holding the encoder, loss schedule, splits, and threshold search fixed reduces mean IoU from 0.9389 to 0.8404, a 9.85 percentage-point gap. Inserting the same circuit as skip-connection attention in a custom U-Net improves IoU by 0.88 points over the SolidUNet baseline, showing that the PQC contribution depends on where and what it gates. These results provide controlled evidence that quantum feature fusion can improve dense seismic segmentation.
[LG-100] Stochastic Compositional Optimization via Hybrid Momentum Frank–Wolfe
链接: https://arxiv.org/abs/2605.15350
作者: El Mahdi Chayti
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Stochastic compositional optimization minimizes objectives of the form \min_\bmx \in \mathcalX F(\bmf(\bmx), \bmx) , where \bmf is accessible only through noisy stochastic queries. Existing methods for this problem assume that the outer function F is continuously differentiable, which excludes many practically important applications such as robust max-of-losses, Conditional Value-at-Risk, and norm regularizers. We propose the Hybrid Momentum Stochastic Frank–Wolfe algorithm, which drops the smoothness assumption on F . By combining a momentum-based Jacobian tracker with a Taylor-corrected function tracker, the algorithm feeds an entire stochastic linearization – rather than a single gradient – into a generalized linear minimization oracle. We establish an \mathcalO(K^-1/4) convergence rate in the generalized Frank–Wolfe gap for non-convex objectives with L_F -Lipschitz outer functions, matching the optimal complexity for projection-free single-sample stochastic methods under expected smoothness. The analysis extends to heavy-tailed noise oracles with bounded r -th moments for r \in (1, 2] and recovers the deterministic rates of Vladarean et al (2023) as the noise vanishes.
[LG-101] On Kernel Eigen-alignments of KRR: Reconstruction and Generalization
链接: https://arxiv.org/abs/2605.15240
作者: Yang Liu,Ernest Fokoue,Richard Lange,Daniel Krutz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates the critical role of eigenalignments between the kernel matrix and learning targets in achieving robust generalization in learning problems. We establish a direct connection between generalization performance in kernel methods and the estimation of eigenvectors and eigenvalues of matrices, offering a more intuitive understanding compared to prior work with minimal assumptions. We also show that, since the prediction task in KRR is essentially the weighted sum of eigenvectors/singular vectors, by analyzing how much error can be caused by perturbations to the kernel matrix, we can then derive a bound on this generalization error using the estimation stability of matrix eigenvalues and eigenvectors. Compared with previous work, our analysis concentrates on finite-sample settings and on the generalization error arising from having a suboptimal finite training set. Our findings reveal that in kernel methods, as long as the kernel is of high rank, the near-zero reconstruction error can be trivially obtained, implying that the reconstruction error will have limited predictive power for generalization. Finally, we establish a generalization bound from an eigenvalues/eigenvectors estimation perspective, showing that strong generalization requires increasing eigenvector alignment, eigenvalue magnitude, or gaps between consecutive eigenvalues.
附件下载


